linux-kernel - Re: [PATCH 4/5] mm: simplify reclaim path for MADV

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <EDCE64A3-D874-4FE3-91B5-DE5E26A452F5@gmail.com>
Date:	Tue, 27 Oct 2015 11:44:09 +0800
From:	yalin wang <yalin.wang2010@...il.com>
To:	Hugh Dickins <hughd@...gle.com>
Cc:	Minchan Kim <minchan@...nel.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	"open list:MEMORY MANAGEMENT" <linux-mm@...ck.org>,
	lkml <linux-kernel@...r.kernel.org>,
	Rik van Riel <riel@...hat.com>, Mel Gorman <mgorman@...e.de>,
	Michal Hocko <mhocko@...e.cz>,
	Johannes Weiner <hannes@...xchg.org>,
	"Kirill A. Shutemov" <kirill@...temov.name>,
	Vlastimil Babka <vbabka@...e.cz>
Subject: Re: [PATCH 4/5] mm: simplify reclaim path for MADV_FREE


> On Oct 27, 2015, at 10:09, Hugh Dickins <hughd@...gle.com> wrote:
> 
> On Mon, 19 Oct 2015, Minchan Kim wrote:
> 
>> I made reclaim path mess to check and free MADV_FREEed page.
>> This patch simplify it with tweaking add_to_swap.
>> 
>> So far, we mark page as PG_dirty when we add the page into
>> swap cache(ie, add_to_swap) to page out to swap device but
>> this patch moves PG_dirty marking under try_to_unmap_one
>> when we decide to change pte from anon to swapent so if
>> any process's pte has swapent for the page, the page must
>> be swapped out. IOW, there should be no funcional behavior
>> change. It makes relcaim path really simple for MADV_FREE
>> because we just need to check PG_dirty of page to decide
>> discarding the page or not.
>> 
>> Other thing this patch does is to pass TTU_BATCH_FLUSH to
>> try_to_unmap when we handle freeable page because I don't
>> see any reason to prevent it.
>> 
>> Cc: Hugh Dickins <hughd@...gle.com>
>> Cc: Mel Gorman <mgorman@...e.de>
>> Signed-off-by: Minchan Kim <minchan@...nel.org>
> 
> Acked-by: Hugh Dickins <hughd@...gle.com>
> 
> This is sooooooo much nicer than the code it replaces!  Really good.
> Kudos also to Hannes for suggesting this approach originally, I think.
> 
> I hope this implementation satisfies a good proportion of the people
> who have been wanting MADV_FREE: I'm not among them, and have long
> lost touch with those discussions, so won't judge how usable it is.
> 
> I assume you'll refactor the series again before it goes to Linus,
> so the previous messier implementations vanish?  I notice Andrew
> has this "mm: simplify reclaim path for MADV_FREE" in mmotm as
> mm-dont-split-thp-page-when-syscall-is-called-fix-6.patch:
> I guess it all got much too messy to divide up in a hurry.
> 
> I've noticed no problems in testing (unlike the first time you moved
> to working with pte_dirty); though of course I've not been using
> MADV_FREE itself at all.
> 
> One aspect has worried me for a while, but I think I've reached the
> conclusion that it doesn't matter at all.  The swap that's allocated
> in add_to_swap() would normally get freed again (after try_to_unmap
> found it was a MADV_FREE !pte_dirty !PageDirty case) at the bottom
> of shrink_page_list(), in __remove_mapping(), yes?
> 
> The bit that worried me is that on rare occasions, something unknown
> might take a speculative reference to the page, and __remove_mapping()
> fail to freeze refs for that reason.  Much too rare to worry over not
> freeing that page immediately, but it leaves us with a PageUptodate
> PageSwapCache !PageDirty page, yet its contents are not the contents
> of that location on swap.
> 
> But since this can only happen when you have *not* inserted the
> corresponding swapent anywhere, I cannot think of anything that would
> have a legitimate interest in its contents matching that location on swap.
> So I don't think it's worth looking for somewhere to add a SetPageDirty
> (or a delete_from_swap_cache) just to regularize that case.
> 
>> ---
>> include/linux/rmap.h |  6 +----
>> mm/huge_memory.c     |  5 ----
>> mm/rmap.c            | 42 ++++++----------------------------
>> mm/swap_state.c      |  5 ++--
>> mm/vmscan.c          | 64 ++++++++++++++++------------------------------------
>> 5 files changed, 30 insertions(+), 92 deletions(-)
>> 
>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>> index 6b6233fafb53..978f65066fd5 100644
>> --- a/include/linux/rmap.h
>> +++ b/include/linux/rmap.h
>> @@ -193,8 +193,7 @@ static inline void page_dup_rmap(struct page *page, bool compound)
>>  * Called from mm/vmscan.c to handle paging out
>>  */
>> int page_referenced(struct page *, int is_locked,
>> -			struct mem_cgroup *memcg, unsigned long *vm_flags,
>> -			int *is_pte_dirty);
>> +			struct mem_cgroup *memcg, unsigned long *vm_flags);
>> 
>> #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
>> 
>> @@ -272,11 +271,8 @@ int rmap_walk(struct page *page, struct rmap_walk_control *rwc);
>> static inline int page_referenced(struct page *page, int is_locked,
>> 				  struct mem_cgroup *memcg,
>> 				  unsigned long *vm_flags,
>> -				  int *is_pte_dirty)
>> {
>> 	*vm_flags = 0;
>> -	if (is_pte_dirty)
>> -		*is_pte_dirty = 0;
>> 	return 0;
>> }
>> 
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 269ed99493f0..adccfb48ce57 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1753,11 +1753,6 @@ pmd_t *page_check_address_pmd(struct page *page,
>> 	return NULL;
>> }
>> 
>> -int pmd_freeable(pmd_t pmd)
>> -{
>> -	return !pmd_dirty(pmd);
>> -}
>> -
>> #define VM_NO_THP (VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE)
>> 
>> int hugepage_madvise(struct vm_area_struct *vma,
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 94ee372e238b..fd64f79c87c4 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -797,7 +797,6 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
>> }
>> 
>> struct page_referenced_arg {
>> -	int dirtied;
>> 	int mapcount;
>> 	int referenced;
>> 	unsigned long vm_flags;
>> @@ -812,7 +811,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
>> 	struct mm_struct *mm = vma->vm_mm;
>> 	spinlock_t *ptl;
>> 	int referenced = 0;
>> -	int dirty = 0;
>> 	struct page_referenced_arg *pra = arg;
>> 
>> 	if (unlikely(PageTransHuge(page))) {
>> @@ -835,14 +833,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
>> 		if (pmdp_clear_flush_young_notify(vma, address, pmd))
>> 			referenced++;
>> 
>> -		/*
>> -		 * Use pmd_freeable instead of raw pmd_dirty because in some
>> -		 * of architecture, pmd_dirty is not defined unless
>> -		 * CONFIG_TRANSPARENT_HUGEPAGE is enabled
>> -		 */
>> -		if (!pmd_freeable(*pmd))
>> -			dirty++;
>> -
>> 		spin_unlock(ptl);
>> 	} else {
>> 		pte_t *pte;
>> @@ -873,9 +863,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
>> 				referenced++;
>> 		}
>> 
>> -		if (pte_dirty(*pte))
>> -			dirty++;
>> -
>> 		pte_unmap_unlock(pte, ptl);
>> 	}
>> 
>> @@ -889,9 +876,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
>> 		pra->vm_flags |= vma->vm_flags;
>> 	}
>> 
>> -	if (dirty)
>> -		pra->dirtied++;
>> -
>> 	pra->mapcount--;
>> 	if (!pra->mapcount)
>> 		return SWAP_SUCCESS; /* To break the loop */
>> @@ -916,7 +900,6 @@ static bool invalid_page_referenced_vma(struct vm_area_struct *vma, void *arg)
>>  * @is_locked: caller holds lock on the page
>>  * @memcg: target memory cgroup
>>  * @vm_flags: collect encountered vma->vm_flags who actually referenced the page
>> - * @is_pte_dirty: ptes which have marked dirty bit - used for lazyfree page
>>  *
>>  * Quick test_and_clear_referenced for all mappings to a page,
>>  * returns the number of ptes which referenced the page.
>> @@ -924,8 +907,7 @@ static bool invalid_page_referenced_vma(struct vm_area_struct *vma, void *arg)
>> int page_referenced(struct page *page,
>> 		    int is_locked,
>> 		    struct mem_cgroup *memcg,
>> -		    unsigned long *vm_flags,
>> -		    int *is_pte_dirty)
>> +		    unsigned long *vm_flags)
>> {
>> 	int ret;
>> 	int we_locked = 0;
>> @@ -940,8 +922,6 @@ int page_referenced(struct page *page,
>> 	};
>> 
>> 	*vm_flags = 0;
>> -	if (is_pte_dirty)
>> -		*is_pte_dirty = 0;
>> 
>> 	if (!page_mapped(page))
>> 		return 0;
>> @@ -970,9 +950,6 @@ int page_referenced(struct page *page,
>> 	if (we_locked)
>> 		unlock_page(page);
>> 
>> -	if (is_pte_dirty)
>> -		*is_pte_dirty = pra.dirtied;
>> -
>> 	return pra.referenced;
>> }
>> 
>> @@ -1453,17 +1430,10 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>> 		swp_entry_t entry = { .val = page_private(page) };
>> 		pte_t swp_pte;
>> 
>> -		if (flags & TTU_FREE) {
>> -			VM_BUG_ON_PAGE(PageSwapCache(page), page);
>> -			if (!PageDirty(page)) {
>> -				/* It's a freeable page by MADV_FREE */
>> -				dec_mm_counter(mm, MM_ANONPAGES);
>> -				goto discard;
>> -			} else {
>> -				set_pte_at(mm, address, pte, pteval);
>> -				ret = SWAP_FAIL;
>> -				goto out_unmap;
>> -			}
>> +		if (!PageDirty(page) && (flags & TTU_FREE)) {
>> +			/* It's a freeable page by MADV_FREE */
>> +			dec_mm_counter(mm, MM_ANONPAGES);
>> +			goto discard;
>> 		}
>> 
>> 		if (PageSwapCache(page)) {
>> @@ -1476,6 +1446,8 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>> 				ret = SWAP_FAIL;
>> 				goto out_unmap;
>> 			}
>> +			if (!PageDirty(page))
>> +				SetPageDirty(page);
>> 			if (list_empty(&mm->mmlist)) {
>> 				spin_lock(&mmlist_lock);
>> 				if (list_empty(&mm->mmlist))
>> diff --git a/mm/swap_state.c b/mm/swap_state.c
>> index d783872d746c..676ff2991380 100644
>> --- a/mm/swap_state.c
>> +++ b/mm/swap_state.c
>> @@ -185,13 +185,12 @@ int add_to_swap(struct page *page, struct list_head *list)
>> 	 * deadlock in the swap out path.
>> 	 */
>> 	/*
>> -	 * Add it to the swap cache and mark it dirty
>> +	 * Add it to the swap cache.
>> 	 */
>> 	err = add_to_swap_cache(page, entry,
>> 			__GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN);
>> 
>> -	if (!err) {	/* Success */
>> -		SetPageDirty(page);
>> +	if (!err) {
>> 		return 1;
>> 	} else {	/* -ENOMEM radix-tree allocation failure */
>> 		/*
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 27d580b5e853..9b52ecf91194 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -791,17 +791,15 @@ enum page_references {
>> };
>> 
>> static enum page_references page_check_references(struct page *page,
>> -						  struct scan_control *sc,
>> -						  bool *freeable)
>> +						  struct scan_control *sc)
>> {
>> 	int referenced_ptes, referenced_page;
>> 	unsigned long vm_flags;
>> -	int pte_dirty;
>> 
>> 	VM_BUG_ON_PAGE(!PageLocked(page), page);
>> 
>> 	referenced_ptes = page_referenced(page, 1, sc->target_mem_cgroup,
>> -					  &vm_flags, &pte_dirty);
>> +					  &vm_flags);
>> 	referenced_page = TestClearPageReferenced(page);
>> 
>> 	/*
>> @@ -842,10 +840,6 @@ static enum page_references page_check_references(struct page *page,
>> 		return PAGEREF_KEEP;
>> 	}
>> 
>> -	if (PageAnon(page) && !pte_dirty && !PageSwapCache(page) &&
>> -			!PageDirty(page))
>> -		*freeable = true;
>> -
>> 	/* Reclaim if clean, defer dirty pages to writeback */
>> 	if (referenced_page && !PageSwapBacked(page))
>> 		return PAGEREF_RECLAIM_CLEAN;
>> @@ -1037,8 +1031,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>> 		}
>> 
>> 		if (!force_reclaim)
>> -			references = page_check_references(page, sc,
>> -							&freeable);
>> +			references = page_check_references(page, sc);
>> 
>> 		switch (references) {
>> 		case PAGEREF_ACTIVATE:
>> @@ -1055,31 +1048,24 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>> 		 * Try to allocate it some swap space here.
>> 		 */
>> 		if (PageAnon(page) && !PageSwapCache(page)) {
>> -			if (!freeable) {
>> -				if (!(sc->gfp_mask & __GFP_IO))
>> -					goto keep_locked;
>> -				if (!add_to_swap(page, page_list))
>> -					goto activate_locked;
>> -				may_enter_fs = 1;
>> -				/* Adding to swap updated mapping */
>> -				mapping = page_mapping(page);
>> -			} else {
>> -				if (likely(!PageTransHuge(page)))
>> -					goto unmap;
>> -				/* try_to_unmap isn't aware of THP page */
>> -				if (unlikely(split_huge_page_to_list(page,
>> -								page_list)))
>> -					goto keep_locked;
>> -			}
>> +			if (!(sc->gfp_mask & __GFP_IO))
>> +				goto keep_locked;
>> +			if (!add_to_swap(page, page_list))
>> +				goto activate_locked;
>> +			freeable = true;
>> +			may_enter_fs = 1;
>> +			/* Adding to swap updated mapping */
>> +			mapping = page_mapping(page);
>> 		}
>> -unmap:
>> +
>> 		/*
>> 		 * The page is mapped into the page tables of one or more
>> 		 * processes. Try to unmap it here.
>> 		 */
>> -		if (page_mapped(page) && (mapping || freeable)) {
>> +		if (page_mapped(page) && mapping) {
>> 			switch (try_to_unmap(page, freeable ?
>> -					TTU_FREE : ttu_flags|TTU_BATCH_FLUSH)) {
>> +					ttu_flags | TTU_BATCH_FLUSH | TTU_FREE :
>> +					ttu_flags | TTU_BATCH_FLUSH)) {
>> 			case SWAP_FAIL:
>> 				goto activate_locked;
>> 			case SWAP_AGAIN:
>> @@ -1087,20 +1073,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>> 			case SWAP_MLOCK:
>> 				goto cull_mlocked;
>> 			case SWAP_SUCCESS:
>> -				/* try to free the page below */
>> -				if (!freeable)
>> -					break;
>> -				/*
>> -				 * Freeable anon page doesn't have mapping
>> -				 * due to skipping of swapcache so we free
>> -				 * page in here rather than __remove_mapping.
>> -				 */
>> -				VM_BUG_ON_PAGE(PageSwapCache(page), page);
>> -				if (!page_freeze_refs(page, 1))
>> -					goto keep_locked;
>> -				__ClearPageLocked(page);
>> -				count_vm_event(PGLAZYFREED);
>> -				goto free_it;
>> +				; /* try to free the page below */
>> 			}
>> 		}
>> 
>> @@ -1217,6 +1190,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>> 		 */
>> 		__ClearPageLocked(page);
>> free_it:
>> +		if (freeable && !PageDirty(page))
>> +			count_vm_event(PGLAZYFREED);
>> +
>> 		nr_reclaimed++;
>> 
>> 		/*
>> @@ -1847,7 +1823,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
>> 		}
>> 
>> 		if (page_referenced(page, 0, sc->target_mem_cgroup,
>> -				    &vm_flags, NULL)) {
>> +				    &vm_flags)) {
>> 			nr_rotated += hpage_nr_pages(page);
>> 			/*
>> 			 * Identify referenced, file-backed active pages and
>> -- 
>> 1.9.1
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
it is wrong here if you only check PageDirty() to decide if the page is freezable or not .
The Anon page are shared by multiple process, _mapcount > 1 ,
so you must check all pt_dirty bit during page_referenced() function,
see this mail thread:
http://ns1.ske-art.com/lists/kernel/msg1934021.html
Thanks









--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/