linux-kernel - Re: [PATCH v3] support madvise(MADV

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20140408233004.GD22728@bbox>
Date:	Wed, 9 Apr 2014 08:30:04 +0900
From:	Minchan Kim <minchan@...nel.org>
To:	Zhang Yanfei <zhangyanfei@...fujitsu.com>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org,
	Dave Hansen <dave.hansen@...el.com>,
	John Stultz <john.stultz@...aro.org>,
	Hugh Dickins <hughd@...gle.com>,
	Johannes Weiner <hannes@...xchg.org>,
	Rik van Riel <riel@...hat.com>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	Mel Gorman <mgorman@...e.de>, Jason Evans <je@...com>
Subject: Re: [PATCH v3] support madvise(MADV_FREE)

Hello Zhang,

On Tue, Apr 08, 2014 at 01:35:25PM +0800, Zhang Yanfei wrote:
> Hello Minchan,
> 
> Nice work. It seems I forgot to reply your mail in the first RFC version
> of the patch, so I just review this new one.
> 
> On 04/07/2014 10:51 AM, Minchan Kim wrote:
> > Linux doesn't have an ability to free pages lazy while other OS
> > already have been supported that named by madvise(MADV_FREE).
> > 
> > The gain is clear that kernel can discard freed pages rather than
> > swapping out or OOM if memory pressure happens.
> > 
> > Without memory pressure, freed pages would be reused by userspace
> > without another additional overhead(ex, page fault + allocation
> > + zeroing).
> 
> Just from this, I cannot see how you handle the 'lazyfree' pages and
> how to distinguish 'lazyfree' and 'nonlazyfree' pages. Anyway, I've
> known these from the patch itself, so I know them, but it is difficult
> for someone who wants to get the knowledge from the log. So could you
> add a little explanation into this log to help understanding the code,
> especially you have put all the changes into one patch?

Okay.

How about this?


======= &< =======

Linux doesn't have an ability to free pages lazy while other OS
already have been supported that named by madvise(MADV_FREE).

The gain is clear that kernel can discard freed pages rather than
swapping out or OOM if memory pressure happens.

Without memory pressure, freed pages would be reused by userspace
without another additional overhead(ex, page fault + allocation
+ zeroing).

How to work is following as.

When madvise syscall is called, VM clears dirty bit of ptes of
the range. If memory pressure happens, VM checks dirty bit of
page table and if it found still "clean", it means it's a
"lazyfree pages" so VM could discard the page instead of swapping out.
Once there was store operation for the page before VM peek a page
to reclaim, dirty bit is set so VM can swap out the page instead of
discarding.

Firstly, heavy users would be general allocators(ex, jemalloc,
tcmalloc and hope glibc supports it) and jemalloc/tcmalloc already
have supported the feature for other OS(ex, FreeBSD)

...
...
...

> 
> > 
> > Firstly, heavy users would be general allocators(ex, jemalloc,
> > tcmalloc and hope glibc supports it) and jemalloc/tcmalloc already
> > have supported the feature for other OS(ex, FreeBSD)
> > 
> > barrios@...ptop:~/benchmark/ebizzy$ lscpu
> > Architecture:          x86_64
> > CPU op-mode(s):        32-bit, 64-bit
> > Byte Order:            Little Endian
> > CPU(s):                4
> > On-line CPU(s) list:   0-3
> > Thread(s) per core:    2
> > Core(s) per socket:    2
> > Socket(s):             1
> > NUMA node(s):          1
> > Vendor ID:             GenuineIntel
> > CPU family:            6
> > Model:                 42
> > Stepping:              7
> > CPU MHz:               2801.000
> > BogoMIPS:              5581.64
> > Virtualization:        VT-x
> > L1d cache:             32K
> > L1i cache:             32K
> > L2 cache:              256K
> > L3 cache:              4096K
> > NUMA node0 CPU(s):     0-3
> > 
> > ebizzy benchmark(./ebizzy -S 10 -n 512)
> > 
> >  vanilla-jemalloc		MADV_free-jemalloc
> > 
> > 1 thread
> > records:  10              records:  10
> > avg:      7436.70         avg:      15292.70
> > std:      48.01(0.65%)    std:      496.40(3.25%)
> > max:      7542.00         max:      15944.00
> > min:      7366.00         min:      14478.00
> > 
> > 2 thread
> > records:  10              records:  10
> > avg:      12190.50        avg:      24975.50
> > std:      1011.51(8.30%)  std:      1127.22(4.51%)
> > max:      13012.00        max:      26382.00
> > min:      10192.00        min:      23265.00
> > 
> > 4 thread
> > records:  10              records:  10
> > avg:      16875.30        avg:      36320.90
> > std:      562.59(3.33%)   std:      1503.75(4.14%)
> > max:      17465.00        max:      38314.00
> > min:      15552.00        min:      33863.00
> > 
> > 8 thread
> > records:  10              records:  10
> > avg:      16966.80        avg:      35915.20
> > std:      229.35(1.35%)   std:      2153.89(6.00%)
> > max:      17456.00        max:      37943.00
> > min:      16742.00        min:      29891.00
> > 
> > 16 thread
> > records:  10              records:  10
> > avg:      20590.90        avg:      37388.40
> > std:      362.33(1.76%)   std:      1282.59(3.43%)
> > max:      20954.00        max:      38911.00
> > min:      19985.00        min:      34928.00
> > 
> > 32 thread
> > records:  10              records:  10
> > avg:      22633.40        avg:      37118.00
> > std:      413.73(1.83%)   std:      766.36(2.06%)
> > max:      23120.00        max:      38328.00
> > min:      22071.00        min:      35557.00
> > 
> > In summary, MADV_FREE is about 2 time faster than MADV_DONTNEED.
> > 
> > Patchset is based on 3.14
> > 
> > * From v2
> >  * Remove forceful dirty marking of swap-readed page - Johannes
> >  * Remove deactivation logic of lazyfreed page
> >  * Rebased on 3.14
> >  * Remove RFC tag
> > 
> > * From v1
> >  * Use custom page table walker for madvise_free - Johannes
> >  * Remove PG_lazypage flag - Johannes
> >  * Do madvise_dontneed instead of madvise_freein swapless system
> > 
> > Cc: Hugh Dickins <hughd@...gle.com>
> > Cc: Johannes Weiner <hannes@...xchg.org>
> > Cc: Rik van Riel <riel@...hat.com>
> > Cc: KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>
> > Cc: Mel Gorman <mgorman@...e.de>
> > Cc: Jason Evans <je@...com>
> > Signed-off-by: Minchan Kim <minchan@...nel.org>
> > ---
> >  include/linux/mm.h                     |   2 +
> >  include/linux/rmap.h                   |  14 +++-
> >  include/linux/vm_event_item.h          |   1 +
> >  include/uapi/asm-generic/mman-common.h |   1 +
> >  mm/madvise.c                           |  25 ++++++
> >  mm/memory.c                            | 141 +++++++++++++++++++++++++++++++++
> >  mm/rmap.c                              |  62 +++++++++++++--
> >  mm/vmscan.c                            |  30 ++++++-
> >  mm/vmstat.c                            |   1 +
> >  9 files changed, 267 insertions(+), 10 deletions(-)
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index c1b7414c7bef..79af90212c19 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1063,6 +1063,8 @@ int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
> >  		unsigned long size);
> >  void zap_page_range(struct vm_area_struct *vma, unsigned long address,
> >  		unsigned long size, struct zap_details *);
> > +void lazyfree_range(struct vm_area_struct *vma, unsigned long address,
> > +		unsigned long size);
> >  void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
> >  		unsigned long start, unsigned long end);
> >  
> > diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> > index b66c2110cb1f..1f3517027ffa 100644
> > --- a/include/linux/rmap.h
> > +++ b/include/linux/rmap.h
> > @@ -182,7 +182,8 @@ static inline void page_dup_rmap(struct page *page)
> >   * Called from mm/vmscan.c to handle paging out
> >   */
> >  int page_referenced(struct page *, int is_locked,
> > -			struct mem_cgroup *memcg, unsigned long *vm_flags);
> > +			struct mem_cgroup *memcg, unsigned long *vm_flags,
> > +			int *is_dirty);
> >  int page_referenced_one(struct page *, struct vm_area_struct *,
> >  	unsigned long address, void *arg);
> >  
> > @@ -235,6 +236,11 @@ struct anon_vma *page_lock_anon_vma_read(struct page *page);
> >  void page_unlock_anon_vma_read(struct anon_vma *anon_vma);
> >  int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma);
> >  
> > +struct rmap_private {
> > +	enum ttu_flags flags;
> > +	int pte_dirty;	/* used for lazyfree */
> > +};
> > +
> >  /*
> >   * rmap_walk_control: To control rmap traversing for specific needs
> >   *
> > @@ -265,9 +271,12 @@ int rmap_walk(struct page *page, struct rmap_walk_control *rwc);
> >  
> >  static inline int page_referenced(struct page *page, int is_locked,
> >  				  struct mem_cgroup *memcg,
> > -				  unsigned long *vm_flags)
> > +				  unsigned long *vm_flags,
> > +				  int *is_pte_dirty)
> >  {
> >  	*vm_flags = 0;
> > +	if (is_pte_dirty)
> > +		*is_pte_dirty = 0;
> >  	return 0;
> >  }
> >  
> > @@ -288,5 +297,6 @@ static inline int page_mkclean(struct page *page)
> >  #define SWAP_AGAIN	1
> >  #define SWAP_FAIL	2
> >  #define SWAP_MLOCK	3
> > +#define SWAP_DISCARD	4
> >  
> >  #endif	/* _LINUX_RMAP_H */
> > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> > index 3a712e2e7d76..a69680d335bb 100644
> > --- a/include/linux/vm_event_item.h
> > +++ b/include/linux/vm_event_item.h
> > @@ -25,6 +25,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> >  		FOR_ALL_ZONES(PGALLOC),
> >  		PGFREE, PGACTIVATE, PGDEACTIVATE,
> >  		PGFAULT, PGMAJFAULT,
> > +		PGLAZYFREED,
> >  		FOR_ALL_ZONES(PGREFILL),
> >  		FOR_ALL_ZONES(PGSTEAL_KSWAPD),
> >  		FOR_ALL_ZONES(PGSTEAL_DIRECT),
> > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> > index 4164529a94f9..b7eedd33ebab 100644
> > --- a/include/uapi/asm-generic/mman-common.h
> > +++ b/include/uapi/asm-generic/mman-common.h
> > @@ -34,6 +34,7 @@
> >  #define MADV_SEQUENTIAL	2		/* expect sequential page references */
> >  #define MADV_WILLNEED	3		/* will need these pages */
> >  #define MADV_DONTNEED	4		/* don't need these pages */
> > +#define MADV_FREE	5		/* free pages only if memory pressure */
> >  
> >  /* common parameters: try to keep these consistent across architectures */
> >  #define MADV_REMOVE	9		/* remove these pages & resources */
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 539eeb96b323..4a2aa96b2db3 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -31,6 +31,7 @@ static int madvise_need_mmap_write(int behavior)
> >  	case MADV_REMOVE:
> >  	case MADV_WILLNEED:
> >  	case MADV_DONTNEED:
> > +	case MADV_FREE:
> >  		return 0;
> >  	default:
> >  		/* be safe, default to 1. list exceptions explicitly */
> > @@ -251,6 +252,22 @@ static long madvise_willneed(struct vm_area_struct *vma,
> >  	return 0;
> >  }
> >  
> > +static long madvise_lazyfree(struct vm_area_struct *vma,
> > +			     struct vm_area_struct **prev,
> > +			     unsigned long start, unsigned long end)
> > +{
> > +	*prev = vma;
> > +	if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
> > +		return -EINVAL;
> > +
> > +	/* MADV_FREE works for only anon vma at the moment */
> > +	if (vma->vm_file)
> > +		return -EINVAL;
> > +
> > +	lazyfree_range(vma, start, end - start);
> > +	return 0;
> > +}
> > +
> >  /*
> >   * Application no longer needs these pages.  If the pages are dirty,
> >   * it's OK to just throw them away.  The app will be more careful about
> > @@ -384,6 +401,13 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
> >  		return madvise_remove(vma, prev, start, end);
> >  	case MADV_WILLNEED:
> >  		return madvise_willneed(vma, prev, start, end);
> > +	case MADV_FREE:
> > +		/*
> > +		 * In this implementation, MADV_FREE works like MADV_DONTNEED
> > +		 * on swapless system or full swap.
> > +		 */
> > +		if (get_nr_swap_pages() > 0)
> > +			return madvise_lazyfree(vma, prev, start, end);
> >  	case MADV_DONTNEED:
> >  		return madvise_dontneed(vma, prev, start, end);
> >  	default:
> > @@ -403,6 +427,7 @@ madvise_behavior_valid(int behavior)
> >  	case MADV_REMOVE:
> >  	case MADV_WILLNEED:
> >  	case MADV_DONTNEED:
> > +	case MADV_FREE:
> >  #ifdef CONFIG_KSM
> >  	case MADV_MERGEABLE:
> >  	case MADV_UNMERGEABLE:
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 22dfa617bddb..7811faa8bbe3 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1268,6 +1268,105 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
> >  	return addr;
> >  }
> >  
> > +static unsigned long lazyfree_pte_range(struct mmu_gather *tlb,
> > +				struct vm_area_struct *vma, pmd_t *pmd,
> > +				unsigned long addr, unsigned long end)
> > +{
> > +	struct mm_struct *mm = tlb->mm;
> > +	spinlock_t *ptl;
> > +	pte_t *start_pte;
> > +	pte_t *pte;
> > +
> > +	start_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> > +	pte = start_pte;
> > +	arch_enter_lazy_mmu_mode();
> > +	do {
> > +		pte_t ptent = *pte;
> > +
> > +		if (pte_none(ptent))
> > +			continue;
> > +
> > +		if (!pte_present(ptent))
> > +			continue;
> > +
> > +		ptent = pte_mkold(ptent);
> > +		ptent = pte_mkclean(ptent);
> > +		set_pte_at(mm, addr, pte, ptent);
> > +		tlb_remove_tlb_entry(tlb, pte, addr);
> > +	} while (pte++, addr += PAGE_SIZE, addr != end);
> > +	arch_leave_lazy_mmu_mode();
> > +	pte_unmap_unlock(start_pte, ptl);
> > +
> > +	return addr;
> > +}
> > +
> > +static inline unsigned long lazyfree_pmd_range(struct mmu_gather *tlb,
> > +				struct vm_area_struct *vma, pud_t *pud,
> > +				unsigned long addr, unsigned long end)
> > +{
> > +	pmd_t *pmd;
> > +	unsigned long next;
> > +
> > +	pmd = pmd_offset(pud, addr);
> > +	do {
> > +		next = pmd_addr_end(addr, end);
> > +		/* TODO : optimize */
> > +		if (pmd_trans_huge(*pmd))
> > +			split_huge_page_pmd(vma, addr, pmd);
> > +		/*
> > +		 * Here there can be other concurrent MADV_DONTNEED or
> > +		 * trans huge page faults running, and if the pmd is
> > +		 * none or trans huge it can change under us. This is
> > +		 * because MADV_LAZYFREE holds the mmap_sem in read
> > +		 * mode.
> > +		 */
> > +		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
> > +			goto next;
> > +		next = lazyfree_pte_range(tlb, vma, pmd, addr, next);
> > +next:
> > +		cond_resched();
> > +	} while (pmd++, addr = next, addr != end);
> > +
> > +	return addr;
> > +}
> > +
> > +static inline unsigned long lazyfree_pud_range(struct mmu_gather *tlb,
> > +				struct vm_area_struct *vma, pgd_t *pgd,
> > +				unsigned long addr, unsigned long end)
> > +{
> > +	pud_t *pud;
> > +	unsigned long next;
> > +
> > +	pud = pud_offset(pgd, addr);
> > +	do {
> > +		next = pud_addr_end(addr, end);
> > +		if (pud_none_or_clear_bad(pud))
> > +			continue;
> > +		next = lazyfree_pmd_range(tlb, vma, pud, addr, next);
> > +	} while (pud++, addr = next, addr != end);
> > +
> > +	return addr;
> > +}
> > +
> > +static void lazyfree_page_range(struct mmu_gather *tlb,
> > +			     struct vm_area_struct *vma,
> > +			     unsigned long addr, unsigned long end)
> > +{
> > +	pgd_t *pgd;
> > +	unsigned long next;
> > +
> > +	BUG_ON(addr >= end);
> > +	tlb_start_vma(tlb, vma);
> > +	pgd = pgd_offset(vma->vm_mm, addr);
> > +	do {
> > +		next = pgd_addr_end(addr, end);
> > +		if (pgd_none_or_clear_bad(pgd))
> > +			continue;
> > +		next = lazyfree_pud_range(tlb, vma, pgd, addr, next);
> > +	} while (pgd++, addr = next, addr != end);
> > +	tlb_end_vma(tlb, vma);
> > +}
> > +
> >  static void unmap_page_range(struct mmu_gather *tlb,
> >  			     struct vm_area_struct *vma,
> >  			     unsigned long addr, unsigned long end,
> > @@ -1294,6 +1393,23 @@ static void unmap_page_range(struct mmu_gather *tlb,
> >  }
> >  
> >  
> > +static void lazyfree_single_vma(struct mmu_gather *tlb,
> > +		struct vm_area_struct *vma, unsigned long start_addr,
> > +		unsigned long end_addr)
> > +{
> > +	unsigned long start = max(vma->vm_start, start_addr);
> > +	unsigned long end;
> > +
> > +	if (start >= vma->vm_end)
> > +		return;
> > +	end = min(vma->vm_end, end_addr);
> > +	if (end <= vma->vm_start)
> > +		return;
> > +
> > +	if (start != end)
> > +		lazyfree_page_range(tlb, vma, start, end);
> > +}
> > +
> >  static void unmap_single_vma(struct mmu_gather *tlb,
> >  		struct vm_area_struct *vma, unsigned long start_addr,
> >  		unsigned long end_addr,
> > @@ -1368,6 +1484,31 @@ void unmap_vmas(struct mmu_gather *tlb,
> >  }
> >  
> >  /**
> > + * lazyfree_range - clear dirty bit of pte in a given range
> > + * @vma: vm_area_struct holding the applicable pages
> > + * @start: starting address of pages
> > + * @size: number of bytes to do lazyfree
> > + *
> > + * Caller must protect the VMA list
> > + */
> > +void lazyfree_range(struct vm_area_struct *vma, unsigned long start,
> > +		unsigned long size)
> > +{
> > +	struct mm_struct *mm = vma->vm_mm;
> > +	struct mmu_gather tlb;
> > +	unsigned long end = start + size;
> > +
> > +	lru_add_drain();
> > +	tlb_gather_mmu(&tlb, mm, start, end);
> > +	update_hiwater_rss(mm);
> > +	mmu_notifier_invalidate_range_start(mm, start, end);
> > +	for ( ; vma && vma->vm_start < end; vma = vma->vm_next)
> > +		lazyfree_single_vma(&tlb, vma, start, end);
> > +	mmu_notifier_invalidate_range_end(mm, start, end);
> > +	tlb_finish_mmu(&tlb, start, end);
> > +}
> > +
> > +/**
> >   * zap_page_range - remove user pages in a given range
> >   * @vma: vm_area_struct holding the applicable pages
> >   * @start: starting address of pages to zap
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 8fc049f9a5a6..ca0ebd28edfa 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -661,6 +661,7 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
> >  }
> >  
> >  struct page_referenced_arg {
> > +	int dirtied;
> >  	int mapcount;
> >  	int referenced;
> >  	unsigned long vm_flags;
> > @@ -675,6 +676,7 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
> >  	struct mm_struct *mm = vma->vm_mm;
> >  	spinlock_t *ptl;
> >  	int referenced = 0;
> > +	int dirty = 0;
> >  	struct page_referenced_arg *pra = arg;
> >  
> >  	if (unlikely(PageTransHuge(page))) {
> > @@ -727,6 +729,10 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
> >  			if (likely(!(vma->vm_flags & VM_SEQ_READ)))
> >  				referenced++;
> >  		}
> > +
> > +		if (pte_dirty(*pte))
> > +			dirty++;
> > +
> >  		pte_unmap_unlock(pte, ptl);
> >  	}
> >  
> > @@ -735,6 +741,9 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
> >  		pra->vm_flags |= vma->vm_flags;
> >  	}
> >  
> > +	if (dirty)
> > +		pra->dirtied++;
> > +
> >  	pra->mapcount--;
> >  	if (!pra->mapcount)
> >  		return SWAP_SUCCESS; /* To break the loop */
> > @@ -759,6 +768,7 @@ static bool invalid_page_referenced_vma(struct vm_area_struct *vma, void *arg)
> >   * @is_locked: caller holds lock on the page
> >   * @memcg: target memory cgroup
> >   * @vm_flags: collect encountered vma->vm_flags who actually referenced the page
> > + * @is_pte_dirty: ptes which have marked dirty bit
> >   *
> >   * Quick test_and_clear_referenced for all mappings to a page,
> >   * returns the number of ptes which referenced the page.
> > @@ -766,7 +776,8 @@ static bool invalid_page_referenced_vma(struct vm_area_struct *vma, void *arg)
> >  int page_referenced(struct page *page,
> >  		    int is_locked,
> >  		    struct mem_cgroup *memcg,
> > -		    unsigned long *vm_flags)
> > +		    unsigned long *vm_flags,
> > +		    int *is_pte_dirty)
> >  {
> >  	int ret;
> >  	int we_locked = 0;
> > @@ -781,6 +792,9 @@ int page_referenced(struct page *page,
> >  	};
> >  
> >  	*vm_flags = 0;
> > +	if (is_pte_dirty)
> > +		*is_pte_dirty = 0;
> > +
> >  	if (!page_mapped(page))
> >  		return 0;
> >  
> > @@ -808,6 +822,9 @@ int page_referenced(struct page *page,
> >  	if (we_locked)
> >  		unlock_page(page);
> >  
> > +	if (is_pte_dirty)
> > +		*is_pte_dirty = pra.dirtied;
> > +
> >  	return pra.referenced;
> >  }
> >  
> > @@ -1120,7 +1137,9 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >  	pte_t pteval;
> >  	spinlock_t *ptl;
> >  	int ret = SWAP_AGAIN;
> > -	enum ttu_flags flags = (enum ttu_flags)arg;
> > +	struct rmap_private *rp = (struct rmap_private *)arg;
> > +	enum ttu_flags flags = rp->flags;
> > +	int dirty = 0;
> >  
> >  	pte = page_check_address(page, mm, address, &ptl, 0);
> >  	if (!pte)
> > @@ -1150,7 +1169,8 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >  	pteval = ptep_clear_flush(vma, address, pte);
> >  
> >  	/* Move the dirty bit to the physical page now the pte is gone. */
> > -	if (pte_dirty(pteval))
> > +	dirty = pte_dirty(pteval);
> > +	if (dirty)
> >  		set_page_dirty(page);
> >  
> >  	/* Update high watermark before we lower rss */
> > @@ -1169,6 +1189,15 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >  		swp_entry_t entry = { .val = page_private(page) };
> >  		pte_t swp_pte;
> >  
> > +		if ((TTU_ACTION(flags) == TTU_UNMAP) && !dirty &&
> > +				!PageSwapCache(page) && !PageDirty(page)) {
> 
> I see this 
>       !dirty && !PageSwapCache(page) && !PageDirty(page)
> test in several places, could be placed into a inline function with a
> appropriate name?

Okay.

bool page_discardable(struct page *page, int pte_dirty);

> 
> 
> > +			dec_mm_counter(mm, MM_ANONPAGES);
> > +			goto discard;
> > +		}
> > +
> > +		if (dirty)
> > +			rp->pte_dirty++;
> > +
> >  		if (PageSwapCache(page)) {
> >  			/*
> >  			 * Store the swap location in the pte.
> > @@ -1187,6 +1216,10 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >  			}
> >  			dec_mm_counter(mm, MM_ANONPAGES);
> >  			inc_mm_counter(mm, MM_SWAPENTS);
> > +		} else if (TTU_ACTION(flags) == TTU_UNMAP) {
> > +			set_pte_at(mm, address, pte, pteval);
> > +			ret = SWAP_FAIL;
> > +			goto out_unmap;
> 
> Can you elaborate this chage a bit?

At a memoent, I'd like to work it for reclaim path, not migration.


> 
> >  		} else if (IS_ENABLED(CONFIG_MIGRATION)) {
> >  			/*
> >  			 * Store the pfn of the page in a special migration
> > @@ -1210,6 +1243,7 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >  	} else
> >  		dec_mm_counter(mm, MM_FILEPAGES);
> >  
> > +discard:
> >  	page_remove_rmap(page);
> >  	page_cache_release(page);
> >  
> > @@ -1470,13 +1504,19 @@ static int page_not_mapped(struct page *page)
> >   * SWAP_AGAIN	- we missed a mapping, try again later
> >   * SWAP_FAIL	- the page is unswappable
> >   * SWAP_MLOCK	- page is mlocked.
> > + * SWAP_DISCARD - same with SWAP_SUCCESS but no need to swap out
> >   */
> >  int try_to_unmap(struct page *page, enum ttu_flags flags)
> >  {
> >  	int ret;
> > +
> > +	struct rmap_private rp = {
> > +		.flags = flags,
> > +	};
> > +
> >  	struct rmap_walk_control rwc = {
> >  		.rmap_one = try_to_unmap_one,
> > -		.arg = (void *)flags,
> > +		.arg = &rp,
> >  		.done = page_not_mapped,
> >  		.file_nonlinear = try_to_unmap_nonlinear,
> >  		.anon_lock = page_lock_anon_vma_read,
> > @@ -1497,8 +1537,14 @@ int try_to_unmap(struct page *page, enum ttu_flags flags)
> >  
> >  	ret = rmap_walk(page, &rwc);
> >  
> > -	if (ret != SWAP_MLOCK && !page_mapped(page))
> > +	if (ret != SWAP_MLOCK && !page_mapped(page)) {
> >  		ret = SWAP_SUCCESS;
> > +		if (TTU_ACTION(flags) == TTU_UNMAP && !rp.pte_dirty &&
> > +			PageAnon(page) && !PageSwapCache(page) &&
> > +			!PageDirty(page) && !PageKsm(page))
> > +			ret = SWAP_DISCARD;
> > +	}
> > +
> >  	return ret;
> >  }
> >  
> > @@ -1520,9 +1566,13 @@ int try_to_unmap(struct page *page, enum ttu_flags flags)
> >  int try_to_munlock(struct page *page)
> >  {
> >  	int ret;
> > +	struct rmap_private rp = {
> > +		.flags = TTU_MUNLOCK,
> > +	};
> > +
> >  	struct rmap_walk_control rwc = {
> >  		.rmap_one = try_to_unmap_one,
> > -		.arg = (void *)TTU_MUNLOCK,
> > +		.arg = &rp,
> >  		.done = page_not_mapped,
> >  		/*
> >  		 * We don't bother to try to find the munlocked page in
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index a9c74b409681..7f1c5a26bc41 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -684,6 +684,7 @@ enum page_references {
> >  	PAGEREF_RECLAIM_CLEAN,
> >  	PAGEREF_KEEP,
> >  	PAGEREF_ACTIVATE,
> > +	PAGEREF_DISCARD,
> >  };
> >  
> >  static enum page_references page_check_references(struct page *page,
> > @@ -691,9 +692,12 @@ static enum page_references page_check_references(struct page *page,
> >  {
> >  	int referenced_ptes, referenced_page;
> >  	unsigned long vm_flags;
> > +	int is_pte_dirty;
> > +
> > +	VM_BUG_ON_PAGE(!PageLocked(page), page);
> >  
> >  	referenced_ptes = page_referenced(page, 1, sc->target_mem_cgroup,
> > -					  &vm_flags);
> > +					  &vm_flags, &is_pte_dirty);
> >  	referenced_page = TestClearPageReferenced(page);
> >  
> >  	/*
> > @@ -734,6 +738,18 @@ static enum page_references page_check_references(struct page *page,
> >  		return PAGEREF_KEEP;
> >  	}
> >  
> > +	/*
> > +	 * We should check PageDirty because swap-in page by read fault
> > +	 * will be swapcache and pte point out the page doesn't have
> > +	 * dirty bit so only pte dirtiness check isn't enough. In this case,
> > +	 * it would be good to check PG_swapcache to filter it out.
> > +	 * If the page is removed from swapcache, it must have PG_dirty
> > +	 * so we should check it to prevent purging non-lazyfree page.
> > +	 */
> 
> Nice explanation. So this is the key point to know how to detect a lazyfree page,
> I think something like this can be put in the patch log, too.

Hmm, it's too detail so I wrote general concept above.
Please look at that.

THanks for the review, Zhang.

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/