linux-kernel - Re: [RFC] mm:prototype for the updated swapoff implementation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <530CF9B8.1020008@surriel.com>
Date:	Tue, 25 Feb 2014 15:14:48 -0500
From:	Rik van Riel <riel@...riel.com>
To:	Kelley Nielsen <kelleynnn@...il.com>
CC:	linux-kernel@...r.kernel.org, linux-mm@...ck.org,
	opw-kernel@...glegroups.com, jamieliu@...gle.com,
	sjenning@...ux.vnet.ibm.com
Subject: Re: [RFC] mm:prototype for the updated swapoff implementation

On 02/18/2014 07:35 PM, Kelley Nielsen wrote:
> The function try_to_unuse() is of quadratic complexity, with a lot of
> wasted effort. It unuses swap entries one by one, potentially iterating
> over all the page tables for all the processes in the system for each
> one.
> 
> This new proposed implementation of try_to_unuse simplifies its
> complexity to linear. It iterates over the system's mms once, unusing
> all the affected entries as it walks each set of page tables. It also
> makes similar changes to shmem_unuse.

Nice work.  After reading some related code, I think I may have found
one of the issues you are running into...

> TODO
> 
> * Find and correct where swap entries are being left behind
> * Probably related: handle case of remaining reference in try_to_unuse

It looks like the way you iterate over the mmlist may not be safe.

I believe you need the mm / prev_mm juggling, to make sure that you
never call mmput on an mm before dereferencing mm->mmlist.next.

Otherwise, the process holding the mm can exit during swapoff, and
swapoff will be the last user of the mm. Calling mmput will free
the mm_struct, after which the memory can be re-used for something
else.

In the best case, this can lead to swapoff continuing the scan at
another spot in the mmlist. At the worst case, the code can follow
a pointer to la-la land.

> * Remove find_next_to_unuse, and the call to it in try_to_unuse,
>   when the previous item has been resolved
> * Handle the failure of swapin_readahead in unuse_pte_range

When swapin_readahead fails to allocate memory, the swapoff operation
can be aborted.

When it runs into a swap slot that is no longer in use, swapoff can
continue.

> * make sure unuse_pte_range is handling multiple ptes in the best way

I would not worry about that for now. The reduction from quadratic
to linear complexity should be quite a performance boost :)

> * clean up after failure of unuse_pte in unuse_pte_range
> * Determine the proper place for the mmlist locks in try_to_unuse

The mmlist lock needs to be held until after mm->mmlist.next has
been dereferenced.

> * Handle count of unused pages for frontswap

This can probably be deferred till later.

> * Determine what kind of housekeeping shmem_unuse needs
> * Tighten up the access control for all the various data structures
>   (for instance, the mutex on shmem_swaplist is held throughout the
>   entire process, which is probably not only unneccesary but problematic)
> * Prevent radix entries with zero indices from being passed to
>   shmem_unuse_inode_index

Can you pass only exceptional radix tree entries?

> * Decide if shmem_unuse_inode* should be combined into one function

Probably :)

> * Find cases in which the errors returned from shmem_getpage_gfp can be
>   gracefully handled in shmem_unuse_inode_index, instead of just failing
> * determine when old comments and housekeeping are no longer needed
>   (there are still some to serve as reminders of the housekeeping that
>    needs to be accounted for)

It probably makes sense to not try to do everything at once, but send
in incrementally improved patches.

> ---
>  include/linux/shmem_fs.h |   2 +-
>  mm/shmem.c               | 146 +++++++-------------
>  mm/swapfile.c            | 352 ++++++++++++++++++++---------------------------
>  3 files changed, 206 insertions(+), 294 deletions(-)
> 
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index 9d55438..af78151 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -55,7 +55,7 @@ extern void shmem_unlock_mapping(struct address_space *mapping);
>  extern struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
>  					pgoff_t index, gfp_t gfp_mask);
>  extern void shmem_truncate_range(struct inode *inode, loff_t start, loff_t end);
> -extern int shmem_unuse(swp_entry_t entry, struct page *page);
> +extern int shmem_unuse(unsigned int type);
>  
>  static inline struct page *shmem_read_mapping_page(
>  				struct address_space *mapping, pgoff_t index)
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 1f18c9d..802456e 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -650,127 +650,87 @@ static void shmem_evict_inode(struct inode *inode)
>  /*
>   * If swap found in inode, free it and move page from swapcache to filecache.
>   */
> -static int shmem_unuse_inode(struct shmem_inode_info *info,
> -			     swp_entry_t swap, struct page **pagep)
> +/* TODO Since there's hardly anything left of this function
> + * now that the things shmem_getpage_gfp does have been removed,
> + * just incorporate its actions into shmem_unuse_inode?
> + */
> +static int shmem_unuse_inode_index(struct shmem_inode_info *info,
> +			     pgoff_t index)
>  {
>  	struct address_space *mapping = info->vfs_inode.i_mapping;
> -	void *radswap;
> -	pgoff_t index;
> +	struct page *pagep;
>  	gfp_t gfp;
>  	int error = 0;
>  
> -	radswap = swp_to_radix_entry(swap);
> -	index = radix_tree_locate_item(&mapping->page_tree, radswap);
> -	if (index == -1)
> -		return 0;
> -
> -	/*
> -	 * Move _head_ to start search for next from here.
> -	 * But be careful: shmem_evict_inode checks list_empty without taking
> -	 * mutex, and there's an instant in list_move_tail when info->swaplist
> -	 * would appear empty, if it were the only one on shmem_swaplist.
> +	gfp = mapping_gfp_mask(mapping);
> +	error = shmem_getpage_gfp(&info->vfs_inode, index, &pagep, SGP_CACHE,
> +			gfp, NULL);
> +	/* TODO: go through all the possible error returns
> +	 * in shmem_getpage_gfp, and determine whether
> +	 * we need to fail, or whether we can gracefully recover.
> +	 * (for instance, if the page was swapped in from somewhere
> +	 * else in the kernel between the start of swapoff and now,
> +	 * and can be safely let go.)
> +	 * For now, send failure up the call chain for all errors.
>  	 */
> -	if (shmem_swaplist.next != &info->swaplist)
> -		list_move_tail(&shmem_swaplist, &info->swaplist);
> +	return error;
> +}
>  
> -	gfp = mapping_gfp_mask(mapping);
> -	if (shmem_should_replace_page(*pagep, gfp)) {
> -		mutex_unlock(&shmem_swaplist_mutex);
> -		error = shmem_replace_page(pagep, gfp, info, index);
> -		mutex_lock(&shmem_swaplist_mutex);
> -		/*
> -		 * We needed to drop mutex to make that restrictive page
> -		 * allocation, but the inode might have been freed while we
> -		 * dropped it: although a racing shmem_evict_inode() cannot
> -		 * complete without emptying the radix_tree, our page lock
> -		 * on this swapcache page is not enough to prevent that -
> -		 * free_swap_and_cache() of our swap entry will only
> -		 * trylock_page(), removing swap from radix_tree whatever.
> -		 *
> -		 * We must not proceed to shmem_add_to_page_cache() if the
> -		 * inode has been freed, but of course we cannot rely on
> -		 * inode or mapping or info to check that.  However, we can
> -		 * safely check if our swap entry is still in use (and here
> -		 * it can't have got reused for another page): if it's still
> -		 * in use, then the inode cannot have been freed yet, and we
> -		 * can safely proceed (if it's no longer in use, that tells
> -		 * nothing about the inode, but we don't need to unuse swap).
> -		 */
> -		if (!page_swapcount(*pagep))
> -			error = -ENOENT;
> -	}
> +/* TODO some pages with a null index are slipping through
> + * and being passed to shmem_unuse_inode_index
> + */
> +static int shmem_unuse_inode(struct shmem_inode_info *info, unsigned int type){
> +	struct address_space *mapping = info->vfs_inode.i_mapping;
> +	void **slot;
> +	struct radix_tree_iter iter;
> +	int error = 0;
>  
> -	/*
> -	 * We rely on shmem_swaplist_mutex, not only to protect the swaplist,
> -	 * but also to hold up shmem_evict_inode(): so inode cannot be freed
> -	 * beneath us (pagelock doesn't help until the page is in pagecache).
> -	 */
> -	if (!error)
> -		error = shmem_add_to_page_cache(*pagep, mapping, index,
> -						GFP_NOWAIT, radswap);
> -	if (error != -ENOMEM) {
> -		/*
> -		 * Truncation and eviction use free_swap_and_cache(), which
> -		 * only does trylock page: if we raced, best clean up here.
> -		 */
> -		delete_from_swap_cache(*pagep);
> -		set_page_dirty(*pagep);
> -		if (!error) {
> -			spin_lock(&info->lock);
> -			info->swapped--;
> -			spin_unlock(&info->lock);
> -			swap_free(swap);
> +	rcu_read_lock();
> +	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, 0){
> +		struct page *page;
> +		pgoff_t index;
> +		swp_entry_t entry;
> +		unsigned int stype;
> +
> +		index = iter.index;
> +		page = radix_tree_deref_slot(slot);
> +		if (unlikely(!page))
> +			continue;
> +		if (radix_tree_exceptional_entry(page)) {
> +			entry = radix_to_swp_entry(page);
> +			stype = swp_type(entry);
> +			if (stype == type){
> +				error = shmem_unuse_inode_index(info, index);
> +			}
>  		}
> -		error = 1;	/* not an error, but entry was found */
> +		if (error)
> +		break;
>  	}
> +	rcu_read_unlock();
>  	return error;
>  }
>  
> -/*
> - * Search through swapped inodes to find and replace swap by page.
> +/* unuse all the shared memory swap entries that
> + * have backing store in the designated swap type.
>   */
> -int shmem_unuse(swp_entry_t swap, struct page *page)
> +int shmem_unuse(unsigned int type)
>  {
>  	struct list_head *this, *next;
>  	struct shmem_inode_info *info;
> -	int found = 0;
>  	int error = 0;
>  
> -	/*
> -	 * There's a faint possibility that swap page was replaced before
> -	 * caller locked it: caller will come back later with the right page.
> -	 */
> -	if (unlikely(!PageSwapCache(page) || page_private(page) != swap.val))
> -		goto out;
> -
> -	/*
> -	 * Charge page using GFP_KERNEL while we can wait, before taking
> -	 * the shmem_swaplist_mutex which might hold up shmem_writepage().
> -	 * Charged back to the user (not to caller) when swap account is used.
> -	 */
> -	error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL);
> -	if (error)
> -		goto out;
> -	/* No radix_tree_preload: swap entry keeps a place for page in tree */
> -
>  	mutex_lock(&shmem_swaplist_mutex);
>  	list_for_each_safe(this, next, &shmem_swaplist) {
>  		info = list_entry(this, struct shmem_inode_info, swaplist);
>  		if (info->swapped)
> -			found = shmem_unuse_inode(info, swap, &page);
> +			error = shmem_unuse_inode(info, type);
>  		else
>  			list_del_init(&info->swaplist);
>  		cond_resched();
> -		if (found)
> +		if (error)
>  			break;
>  	}
>  	mutex_unlock(&shmem_swaplist_mutex);
> -
> -	if (found < 0)
> -		error = found;
> -out:
> -	unlock_page(page);
> -	page_cache_release(page);
>  	return error;
>  }
>  
> @@ -2873,7 +2833,7 @@ int __init shmem_init(void)
>  	return 0;
>  }
>  
> -int shmem_unuse(swp_entry_t swap, struct page *page)
> +int shmem_unuse(unsigned int type)
>  {
>  	return 0;
>  }
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 4a7f7e6..b69e319 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -68,6 +68,9 @@ static DECLARE_WAIT_QUEUE_HEAD(proc_poll_wait);
>  /* Activity counter to indicate that a swapon or swapoff has occurred */
>  static atomic_t proc_poll_event = ATOMIC_INIT(0);
>  
> +/* count instances of unuse_pte for the changelog */
> +static long unusepte_calls = 0;
> +
>  static inline unsigned char swap_count(unsigned char ent)
>  {
>  	return ent & ~SWAP_HAS_CACHE;	/* may include SWAP_HAS_CONT flag */
> @@ -1167,13 +1170,18 @@ out_nolock:
>  
>  static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  				unsigned long addr, unsigned long end,
> -				swp_entry_t entry, struct page *page)
> +				unsigned int type)
>  {
> -	pte_t swp_pte = swp_entry_to_pte(entry);
> +	struct page * page;
> +	swp_entry_t entry;
> +	unsigned int found_type;
>  	pte_t *pte;
>  	int ret = 0;
>  
> +	unusepte_calls++;
> +
>  	/*
> +	 * TODO comment left from original:
>  	 * We don't actually need pte lock while scanning for swp_pte: since
>  	 * we hold page lock and mmap_sem, swp_pte cannot be inserted into the
>  	 * page table while we're scanning; though it could get zapped, and on
> @@ -1184,16 +1192,71 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  	 */
>  	pte = pte_offset_map(pmd, addr);
>  	do {
> +		if (is_swap_pte(*pte)){
> +			entry = pte_to_swp_entry(*pte);
> +			found_type = swp_type(entry);
> +		}
> +		else {
> +			continue;
> +		}
> +		if (found_type == type){
> +			entry = pte_to_swp_entry(*pte);
> +			page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> +					vma, addr);
> +			if (!page){
> +				/* TODO not sure yet what to do here, or
> +				 * how great the chance of the page
> +				 * not existing actually is.
> +				 * There is a comment in try_to_unuse
> +				 * about the page possibly being freed
> +				 * independently, etc
> +				 */
> +				printk("unuse_pte tried to swap in an invalid page\n");
> +				continue;
> +			}
>  		/*
> -		 * swapoff spends a _lot_ of time in this loop!
> -		 * Test inline before going to call unuse_pte.
> +		 * Wait for and lock page.  When do_swap_page races with
> +		 * try_to_unuse, do_swap_page can handle the fault much
> +		 * faster than try_to_unuse can locate the entry.  This
> +		 * apparently redundant "wait_on_page_locked" lets try_to_unuse
> +		 * defer to do_swap_page in such a case - in some tests,
> +		 * do_swap_page and try_to_unuse repeatedly compete.
>  		 */
> -		if (unlikely(maybe_same_pte(*pte, swp_pte))) {
> +			wait_on_page_locked(page);
> +			wait_on_page_writeback(page);
> +			lock_page(page);
> +			wait_on_page_writeback(page);
>  			pte_unmap(pte);
>  			ret = unuse_pte(vma, pmd, addr, entry, page);
> -			if (ret)
> -				goto out;
> -			pte = pte_offset_map(pmd, addr);
> +		/* TODO fix
> +		 * in the new way, we unuse
> +		 * all ptes in the range or fail before returning.
> +		 * For now, leave the return from unuse_pte as is,
> +		 * move on and unuse the next pte.
> +		 */
> +		if (ret < 1){
> +			/* TODO for now, we're just returning
> +			 * the error if unuse_pte fails.
> +			 * we need to clean up the allocated page,
> +			 * plus all the rest of the mess
> +			 */
> +			unlock_page(page);
> +			goto out;
> +		}
> +		/*
> +		 * TODO moved here from try_to_unuse--still relevant?:
> +		 * It is conceivable that a racing task removed this page from
> +		 * swap cache just before we acquired the page lock at the top,
> +		 * or while we dropped it in unuse_mm().  The page might even
> +		 * be back in swap cache on another swap area: that we must not
> +		 * delete, since it may not have been written out to swap yet.
> +		 */
> +		if (PageSwapCache(page) &&
> +		    likely(page_private(page) == entry.val))
> +			delete_from_swap_cache(page);
> +		SetPageDirty(page);
> +		unlock_page(page);
> +		page_cache_release(page);
>  		}
>  	} while (pte++, addr += PAGE_SIZE, addr != end);
>  	pte_unmap(pte - 1);
> @@ -1203,7 +1266,7 @@ out:
>  
>  static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
>  				unsigned long addr, unsigned long end,
> -				swp_entry_t entry, struct page *page)
> +				unsigned int type)
>  {
>  	pmd_t *pmd;
>  	unsigned long next;
> @@ -1214,8 +1277,8 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
>  		next = pmd_addr_end(addr, end);
>  		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
>  			continue;
> -		ret = unuse_pte_range(vma, pmd, addr, next, entry, page);
> -		if (ret)
> +		ret = unuse_pte_range(vma, pmd, addr, next, type);
> +		if (ret < 0)
>  			return ret;
>  	} while (pmd++, addr = next, addr != end);
>  	return 0;
> @@ -1223,7 +1286,7 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
>  
>  static inline int unuse_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
>  				unsigned long addr, unsigned long end,
> -				swp_entry_t entry, struct page *page)
> +				unsigned int type)
>  {
>  	pud_t *pud;
>  	unsigned long next;
> @@ -1234,67 +1297,52 @@ static inline int unuse_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
>  		next = pud_addr_end(addr, end);
>  		if (pud_none_or_clear_bad(pud))
>  			continue;
> -		ret = unuse_pmd_range(vma, pud, addr, next, entry, page);
> -		if (ret)
> +		ret = unuse_pmd_range(vma, pud, addr, next, type);
> +		if (ret < 0)
>  			return ret;
>  	} while (pud++, addr = next, addr != end);
>  	return 0;
>  }
>  
> -static int unuse_vma(struct vm_area_struct *vma,
> -				swp_entry_t entry, struct page *page)
> +static int unuse_vma(struct vm_area_struct *vma, unsigned int type)
>  {
>  	pgd_t *pgd;
>  	unsigned long addr, end, next;
>  	int ret;
>  
> -	if (page_anon_vma(page)) {
> -		addr = page_address_in_vma(page, vma);
> -		if (addr == -EFAULT)
> -			return 0;
> -		else
> -			end = addr + PAGE_SIZE;
> -	} else {
> -		addr = vma->vm_start;
> -		end = vma->vm_end;
> -	}
> +	addr = vma->vm_start;
> +	end = vma->vm_end;
>  
>  	pgd = pgd_offset(vma->vm_mm, addr);
>  	do {
>  		next = pgd_addr_end(addr, end);
>  		if (pgd_none_or_clear_bad(pgd))
>  			continue;
> -		ret = unuse_pud_range(vma, pgd, addr, next, entry, page);
> -		if (ret)
> +		ret = unuse_pud_range(vma, pgd, addr, next, type);
> +		if (ret < 0)
>  			return ret;
>  	} while (pgd++, addr = next, addr != end);
>  	return 0;
>  }
>  
> -static int unuse_mm(struct mm_struct *mm,
> -				swp_entry_t entry, struct page *page)
> +static int unuse_mm(struct mm_struct *mm, unsigned int type)
>  {
>  	struct vm_area_struct *vma;
>  	int ret = 0;
>  
> -	if (!down_read_trylock(&mm->mmap_sem)) {
> -		/*
> -		 * Activate page so shrink_inactive_list is unlikely to unmap
> -		 * its ptes while lock is dropped, so swapoff can make progress.
> -		 */
> -		activate_page(page);
> -		unlock_page(page);
> -		down_read(&mm->mmap_sem);
> -		lock_page(page);
> -	}
> +	down_read(&mm->mmap_sem);
>  	for (vma = mm->mmap; vma; vma = vma->vm_next) {
> -		if (vma->anon_vma && (ret = unuse_vma(vma, entry, page)))
> +		if (vma->anon_vma && (ret = unuse_vma(vma, type)))
>  			break;
>  	}
>  	up_read(&mm->mmap_sem);
>  	return (ret < 0)? ret: 0;
>  }
>  
> +/* TODO: this whole function is no longer necessary 
> + * useful for checking that the swap area is clean,
> + * so leaving until these changes are submitted
> + */
>  /*
>   * Scan swap_map (or frontswap_map if frontswap parameter is true)
>   * from current position to next entry still in use.
> @@ -1341,29 +1389,31 @@ static unsigned int find_next_to_unuse(struct swap_info_struct *si,
>  }
>  
>  /*
> - * We completely avoid races by reading each swap page in advance,
> - * and then search for the process using it.  All the necessary
> - * page table adjustments can then be made atomically.
> - *
>   * if the boolean frontswap is true, only unuse pages_to_unuse pages;
>   * pages_to_unuse==0 means all pages; ignored if frontswap is false
>   */
>  int try_to_unuse(unsigned int type, bool frontswap,
>  		 unsigned long pages_to_unuse)
>  {
> -	struct swap_info_struct *si = swap_info[type];
>  	struct mm_struct *start_mm;
> -	volatile unsigned char *swap_map; /* swap_map is accessed without
> -					   * locking. Mark it as volatile
> -					   * to prevent compiler doing
> -					   * something odd.
> -					   */
> -	unsigned char swcount;
> -	struct page *page;
> -	swp_entry_t entry;
> -	unsigned int i = 0;
> +	struct mm_struct *mm;
> +	struct list_head *p;
>  	int retval = 0;
>  
> +	/* TODO for checking if any entries are left
> +	 * after swapoff finishes
> +	 * for debug purposes, remove before submitting */
> +	struct swap_info_struct *si = swap_info[type];
> +	int i = 0;
> +
> +	/* TODO shmem_unuse needs its housekeeping
> +	 * exactly what needs to be done is not yet
> +	 * determined
> +	 */
> +	retval = shmem_unuse(type);
> +	if (retval)
> +		goto out;
> +
>  	/*
>  	 * When searching mms for an entry, a good strategy is to
>  	 * start at the first mm we freed the previous entry from
> @@ -1380,48 +1430,17 @@ int try_to_unuse(unsigned int type, bool frontswap,
>  	 */
>  	start_mm = &init_mm;
>  	atomic_inc(&init_mm.mm_users);
> +	p = &start_mm->mmlist;
>  
> -	/*
> -	 * Keep on scanning until all entries have gone.  Usually,
> -	 * one pass through swap_map is enough, but not necessarily:
> -	 * there are races when an instance of an entry might be missed.
> +	/* TODO: why do we protect the mmlist? (noob QUESTION)
> +	 * Where should the locks actually go?
>  	 */
> -	while ((i = find_next_to_unuse(si, i, frontswap)) != 0) {
> +	spin_lock(&mmlist_lock);
> +	while (!retval && (p = p->next) != &start_mm->mmlist) {
>  		if (signal_pending(current)) {
>  			retval = -EINTR;
>  			break;
>  		}
> -
> -		/*
> -		 * Get a page for the entry, using the existing swap
> -		 * cache page if there is one.  Otherwise, get a clean
> -		 * page and read the swap into it.
> -		 */
> -		swap_map = &si->swap_map[i];
> -		entry = swp_entry(type, i);
> -		page = read_swap_cache_async(entry,
> -					GFP_HIGHUSER_MOVABLE, NULL, 0);
> -		if (!page) {
> -			/*
> -			 * Either swap_duplicate() failed because entry
> -			 * has been freed independently, and will not be
> -			 * reused since sys_swapoff() already disabled
> -			 * allocation from here, or alloc_page() failed.
> -			 */
> -			swcount = *swap_map;
> -			/*
> -			 * We don't hold lock here, so the swap entry could be
> -			 * SWAP_MAP_BAD (when the cluster is discarding).
> -			 * Instead of fail out, We can just skip the swap
> -			 * entry because swapoff will wait for discarding
> -			 * finish anyway.
> -			 */
> -			if (!swcount || swcount == SWAP_MAP_BAD)
> -				continue;
> -			retval = -ENOMEM;
> -			break;
> -		}
> -
>  		/*
>  		 * Don't hold on to start_mm if it looks like exiting.
>  		 */
> @@ -1431,80 +1450,36 @@ int try_to_unuse(unsigned int type, bool frontswap,
>  			atomic_inc(&init_mm.mm_users);
>  		}
>  
> -		/*
> -		 * Wait for and lock page.  When do_swap_page races with
> -		 * try_to_unuse, do_swap_page can handle the fault much
> -		 * faster than try_to_unuse can locate the entry.  This
> -		 * apparently redundant "wait_on_page_locked" lets try_to_unuse
> -		 * defer to do_swap_page in such a case - in some tests,
> -		 * do_swap_page and try_to_unuse repeatedly compete.
> -		 */
> -		wait_on_page_locked(page);
> -		wait_on_page_writeback(page);
> -		lock_page(page);
> -		wait_on_page_writeback(page);
> +		mm = list_entry(p, struct mm_struct, mmlist);
> +		if (!atomic_inc_not_zero(&mm->mm_users))
> +			continue;
> +		spin_unlock(&mmlist_lock);
> +
> +		cond_resched();
> +
> +		retval = unuse_mm(mm, type);
> +		mmput(mm);
> +		if (retval)
> +			break;
>  
>  		/*
> -		 * Remove all references to entry.
> +		 * Make sure that we aren't completely killing
> +		 * interactive performance.
>  		 */
> -		swcount = *swap_map;
> -		if (swap_count(swcount) == SWAP_MAP_SHMEM) {
> -			retval = shmem_unuse(entry, page);
> -			/* page has already been unlocked and released */
> -			if (retval < 0)
> +		cond_resched();
> +		/* TODO we need another way to count these,
> +		 * because we will now be unusing all an mm's pages
> +		 * on each pass through the loop
> +		 * Ignoring frontswap for now
> +		 */
> +		if (frontswap && pages_to_unuse > 0) {
> +			if (!--pages_to_unuse)
>  				break;
> -			continue;
>  		}
> -		if (swap_count(swcount) && start_mm != &init_mm)
> -			retval = unuse_mm(start_mm, entry, page);
> -
> -		if (swap_count(*swap_map)) {
> -			int set_start_mm = (*swap_map >= swcount);
> -			struct list_head *p = &start_mm->mmlist;
> -			struct mm_struct *new_start_mm = start_mm;
> -			struct mm_struct *prev_mm = start_mm;
> -			struct mm_struct *mm;
> -
> -			atomic_inc(&new_start_mm->mm_users);
> -			atomic_inc(&prev_mm->mm_users);
> -			spin_lock(&mmlist_lock);
> -			while (swap_count(*swap_map) && !retval &&
> -					(p = p->next) != &start_mm->mmlist) {
> -				mm = list_entry(p, struct mm_struct, mmlist);
> -				if (!atomic_inc_not_zero(&mm->mm_users))
> -					continue;
> -				spin_unlock(&mmlist_lock);
> -				mmput(prev_mm);
> -				prev_mm = mm;
>  
> -				cond_resched();
> -
> -				swcount = *swap_map;
> -				if (!swap_count(swcount)) /* any usage ? */
> -					;
> -				else if (mm == &init_mm)
> -					set_start_mm = 1;
> -				else
> -					retval = unuse_mm(mm, entry, page);
> -
> -				if (set_start_mm && *swap_map < swcount) {
> -					mmput(new_start_mm);
> -					atomic_inc(&mm->mm_users);
> -					new_start_mm = mm;
> -					set_start_mm = 0;
> -				}
> -				spin_lock(&mmlist_lock);
> -			}
> -			spin_unlock(&mmlist_lock);
> -			mmput(prev_mm);
> -			mmput(start_mm);
> -			start_mm = new_start_mm;
> -		}
> -		if (retval) {
> -			unlock_page(page);
> -			page_cache_release(page);
> -			break;
> -		}
> +		spin_lock(&mmlist_lock);
> +	}
> +	spin_unlock(&mmlist_lock);
>  
>  		/*
>  		 * If a reference remains (rare), we would like to leave
> @@ -1524,50 +1499,27 @@ int try_to_unuse(unsigned int type, bool frontswap,
>  		 * this splitting happens to be just what is needed to
>  		 * handle where KSM pages have been swapped out: re-reading
>  		 * is unnecessarily slow, but we can fix that later on.
> +		 * TODO move this to unuse_pte_range?
>  		 */
> -		if (swap_count(*swap_map) &&
> -		     PageDirty(page) && PageSwapCache(page)) {
> -			struct writeback_control wbc = {
> -				.sync_mode = WB_SYNC_NONE,
> -			};
> -
> -			swap_writepage(page, &wbc);
> -			lock_page(page);
> -			wait_on_page_writeback(page);
> -		}
> -
> -		/*
> -		 * It is conceivable that a racing task removed this page from
> -		 * swap cache just before we acquired the page lock at the top,
> -		 * or while we dropped it in unuse_mm().  The page might even
> -		 * be back in swap cache on another swap area: that we must not
> -		 * delete, since it may not have been written out to swap yet.
> -		 */
> -		if (PageSwapCache(page) &&
> -		    likely(page_private(page) == entry.val))
> -			delete_from_swap_cache(page);
> -
> -		/*
> -		 * So we could skip searching mms once swap count went
> -		 * to 1, we did not mark any present ptes as dirty: must
> -		 * mark page dirty so shrink_page_list will preserve it.
> -		 */
> -		SetPageDirty(page);
> -		unlock_page(page);
> -		page_cache_release(page);
> -
> -		/*
> -		 * Make sure that we aren't completely killing
> -		 * interactive performance.
> -		 */
> -		cond_resched();
> -		if (frontswap && pages_to_unuse > 0) {
> -			if (!--pages_to_unuse)
> -				break;
> -		}
> -	}
> +/*		if (swap_count(*swap_map) &&
> +*		     PageDirty(page) && PageSwapCache(page)) {
> +*			struct writeback_control wbc = {
> +*				.sync_mode = WB_SYNC_NONE,
> +*			};
> +*
> +*			swap_writepage(page, &wbc);
> +*			lock_page(page);
> +*			wait_on_page_writeback(page);
> +*		}
> +*/
>  
> +	/* TODO check if there are any swap entries we failed to clean up. */
> +	if ((i = find_next_to_unuse(si, i, frontswap)) != 0)
> +		printk("swap entries remain, type not clean\n");
> +	printk("Leaving try_to_unuse\n");
> +	printk("Calls made to unuse_pte: %lu\n", unusepte_calls);
>  	mmput(start_mm);
> +out:
>  	return retval;
>  }
>  
> 


-- 
All rights reversed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/