linux-kernel - Re: [patch][v2] swap: virtual swap readahead

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Pine.LNX.4.64.0906071747440.20105@sister.anvils>
Date:	Sun, 7 Jun 2009 18:55:15 +0100 (BST)
From:	Hugh Dickins <hugh.dickins@...cali.co.uk>
To:	Johannes Weiner <hannes@...xchg.org>
cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Rik van Riel <riel@...hat.com>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Hugh Dickins <hugh.dickins@...cali.co.uk>,
	Andi Kleen <andi@...stfloor.org>, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org
Subject: Re: [patch][v2] swap: virtual swap readahead

On Wed, 3 Jun 2009, Johannes Weiner wrote:
> Hi Andrew,
> 
> I redid the qsbench runs with a bigger page cluster (2^4).  It shows
> improvement on both versions, the patched one still performing better.
> Rik hinted me that we can make the default even bigger when we are
> better at avoiding reading unrelated pages.  I am currently testing
> this.  Here are the timings for 2^4 (i.e. twice the) ra pages:
> 
> vanilla:
> 1 x 2048M [20 runs]  user 101.41/101.06 [1.42] system 11.02/10.83 [0.92] real 368.44/361.31 [48.47]
> 2 x 1024M [20 runs]  user 101.42/101.23 [0.66] system 12.98/13.01 [0.56] real 338.45/338.56 [2.94]
> 4 x 540M [20 runs]  user 101.75/101.62 [1.03] system 10.05/9.52 [1.53] real 371.97/351.88 [77.69]
> 8 x 280M [20 runs]  user 103.35/103.33 [0.63] system 9.80/9.59 [1.72] real 453.48/473.21 [115.61]
> 16 x 128M [20 runs]  user 91.04/91.00 [0.86] system 8.95/9.41 [2.06] real 312.16/342.29 [100.53]
> 
> vswapra:
> 1 x 2048M [20 runs]  user 98.47/98.32 [1.33] system 9.85/9.90 [0.92] real 373.95/382.64 [26.77]
> 2 x 1024M [20 runs]  user 96.89/97.00 [0.44] system 9.52/9.48 [1.49] real 288.43/281.55 [53.12]
> 4 x 540M [20 runs]  user 98.74/98.70 [0.92] system 7.62/7.83 [1.25] real 291.15/296.94 [54.85]
> 8 x 280M [20 runs]  user 100.68/100.59 [0.53] system 7.59/7.62 [0.41] real 305.12/311.29 [26.09]
> 16 x 128M [20 runs]  user 88.67/88.50 [1.02] system 6.06/6.22 [0.72] real 205.29/221.65 [42.06]
> 
> Furthermore I changed the patch to leave shmem alone for now and added
> documentation for the new approach.  And I adjusted the changelog a
> bit.
> 
> Andi, I think the NUMA policy is already taken care of.  Can you have
> another look at it?  Other than that you gave positive feedback - can
> I add your acked-by?
> 
> 	Hannes
> 
> ---
> The current swap readahead implementation reads a physically
> contiguous group of swap slots around the faulting page to take
> advantage of the disk head's position and in the hope that the
> surrounding pages will be needed soon as well.
> 
> This works as long as the physical swap slot order approximates the
> LRU order decently, otherwise it wastes memory and IO bandwidth to
> read in pages that are unlikely to be needed soon.
> 
> However, the physical swap slot layout diverges from the LRU order
> with increasing swap activity, i.e. high memory pressure situations,
> and this is exactly the situation where swapin should not waste any
> memory or IO bandwidth as both are the most contended resources at
> this point.
> 
> Another approximation for LRU-relation is the VMA order as groups of
> VMA-related pages are usually used together.
> 
> This patch combines both the physical and the virtual hint to get a
> good approximation of pages that are sensible to read ahead.
> 
> When both diverge, we either read unrelated data, seek heavily for
> related data, or, what this patch does, just decrease the readahead
> efforts.
> 
> To achieve this, we have essentially two readahead windows of the same
> size: one spans the virtual, the other one the physical neighborhood
> of the faulting page.  We only read where both areas overlap.
> 
> Signed-off-by: Johannes Weiner <hannes@...xchg.org>
> Reviewed-by: Rik van Riel <riel@...hat.com>
> Cc: Hugh Dickins <hugh.dickins@...cali.co.uk>
> Cc: Andi Kleen <andi@...stfloor.org>

I think this a great idea, a very promising approach.  I like it
so much better than Andrew's and others' proposals to dedicate
areas of swap space to distinct virtual objects: which, as you
rightly pointed out, condemn us to unnecessary seeking when
writing swap (and is even more of an issue if we're writing
to SSD rather than HDD).

It would be nice to get results from a wider set of benchmarks
than just qsbench; but I don't think qsbench is biased in your
favour, and I don't think you need go to too much trouble on
that - let's just aim to get your work into mmotm, then we can
all play around with it for a while.  I suppose what I'd most
like is to try it with shmem, which you've understandably left
out for now.

You'll be hating me for the way I made shmem_truncate_range() etc.
nigh incomprehensible when enabling highmem index pages there.
Christoph Rohland's original was much nicer.  Again and again and
again I've wanted to throw all that out, and keep swap entries in
the standard radix tree which keeps the struct page pointers; but
again and again and again, I've been unable to justify losing the
highmem index ability - for a while it seemed as if x86_64 was
going to make highmem a thing of the past, and it's certainly helped
us to ignore 32GB 32-bit; but I think 4GB 32-bit is here a long while.

Though I like the way you localized it all into swapin_readahead(),
I'd prefer to keep ptes out of swap_state.c, and think several issues
will fall away if you turn your patch around.  You'll avoid the pte code
in swap_state.c, you'll satisfy Andi's concerns about locking/kmapping
overhead, and you'll find shmem much much easier, if instead of peering
back at where you've come from in swapin_readahead(), you make the outer
levels (do_swap_page and shmem_getpage) pass a vector of swap entries to
swapin_readahead()?  That vector on stack, and copied from the one page
table or index page (don't bother to cross page table or index page
boundaries) while it was mapped.

It's probably irrelevant to you, but I've attached an untested patch
of mine which stomps somewhat on this area: I was shocked to notice
shmem_getpage() in a list of deep stack offenders a few months back,
realized it was those unpleasant NUMA mempolicy pseudo-vmas, and made
a patch to get rid of them.  I've rebased it to 2.6.30-rc8 and checked
that the resulting kernel runs, but not really tested it; and I think
I didn't even try to get the mpol reference counting right (tends to
be an issue precisely in swapin_readahead, where one mpol ends up used
repeatedly) - mpol refcounting is an arcane art only Lee understands!
I've attached the patch because you may want to glance at it and
decide, either that it's something which is helpful to the direction
you're going in and you'd like to base upon it, or that it's a
distraction and you'd prefer me to keep it to myself until your
changes are in.

But your patch below is incomplete, isn't it?  The old swapin_readahead()
is now called swapin_readahead_phys(), and you want shmem_getpage() to be
using that for now: but no prototype for it and no change to mm/shmem.c.

Hugh

> ---
>  mm/swap_state.c |  115 ++++++++++++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 99 insertions(+), 16 deletions(-)
> 
> version 2:
>   o fall back to physical ra window for shmem
>   o add documentation to the new ra algorithm
> 
> qsbench, 20 runs, 1.7GB RAM, 2GB swap, "mean (standard deviation) median":
> 
> 		vanilla				vswapra
> 
> 1  x 2048M	391.25 ( 71.76) 384.56		445.55 (83.19) 415.41
> 2  x 1024M	384.25 ( 75.00) 423.08		290.26 (31.38) 299.51
> 4  x  540M	553.91 (100.02) 554.57		336.58 (52.49) 331.52
> 8  x  280M	561.08 ( 82.36) 583.12		319.13 (43.17) 307.69
> 16 x  128M	285.51 (113.20) 236.62		214.24 (62.37) 214.15
> 
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -325,27 +325,14 @@ struct page *read_swap_cache_async(swp_e
>  	return found_page;
>  }
>  
> -/**
> - * swapin_readahead - swap in pages in hope we need them soon
> - * @entry: swap entry of this memory
> - * @gfp_mask: memory allocation flags
> - * @vma: user vma this address belongs to
> - * @addr: target address for mempolicy
> - *
> - * Returns the struct page for entry and addr, after queueing swapin.
> - *
> +/*
>   * Primitive swap readahead code. We simply read an aligned block of
>   * (1 << page_cluster) entries in the swap area. This method is chosen
>   * because it doesn't cost us any seek time.  We also make sure to queue
>   * the 'original' request together with the readahead ones...
> - *
> - * This has been extended to use the NUMA policies from the mm triggering
> - * the readahead.
> - *
> - * Caller must hold down_read on the vma->vm_mm if vma is not NULL.
>   */
> -struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> -			struct vm_area_struct *vma, unsigned long addr)
> +static struct page *swapin_readahead_phys(swp_entry_t entry, gfp_t gfp_mask,
> +				struct vm_area_struct *vma, unsigned long addr)
>  {
>  	int nr_pages;
>  	struct page *page;
> @@ -371,3 +358,99 @@ struct page *swapin_readahead(swp_entry_
>  	lru_add_drain();	/* Push any new pages onto the LRU now */
>  	return read_swap_cache_async(entry, gfp_mask, vma, addr);
>  }
> +
> +/**
> + * swapin_readahead - swap in pages in hope we need them soon
> + * @entry: swap entry of this memory
> + * @gfp_mask: memory allocation flags
> + * @vma: user vma this address belongs to
> + * @addr: target address for mempolicy
> + *
> + * Returns the struct page for entry and addr, after queueing swapin.
> + *
> + * The readahead window is the virtual area around the faulting page,
> + * where the physical proximity of the swap slots is taken into
> + * account as well.
> + *
> + * While the swap allocation algorithm tries to keep LRU-related pages
> + * together on the swap backing, it is not reliable on heavy thrashing
> + * systems where concurrent reclaimers allocate swap slots and/or most
> + * anonymous memory pages are already in swap cache.
> + *
> + * On the virtual side, subgroups of VMA-related pages are usually
> + * used together, which gives another hint to LRU relationship.
> + *
> + * By taking both aspects into account, we get a good approximation of
> + * which pages are sensible to read together with the faulting one.
> + *
> + * This has been extended to use the NUMA policies from the mm
> + * triggering the readahead.
> + *
> + * Caller must hold down_read on the vma->vm_mm if vma is not NULL.
> + */
> +struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> +			struct vm_area_struct *vma, unsigned long addr)
> +{
> +	unsigned long start, pos, end;
> +	unsigned long pmin, pmax;
> +	int cluster, window;
> +
> +	if (!vma || !vma->vm_mm)	/* XXX: shmem case */
> +		return swapin_readahead_phys(entry, gfp_mask, vma, addr);
> +
> +	cluster = 1 << page_cluster;
> +	window = cluster << PAGE_SHIFT;
> +
> +	/* Physical range to read from */
> +	pmin = swp_offset(entry) & ~(cluster - 1);
> +	pmax = pmin + cluster;
> +
> +	/* Virtual range to read from */
> +	start = addr & ~(window - 1);
> +	end = start + window;
> +
> +	for (pos = start; pos < end; pos += PAGE_SIZE) {
> +		struct page *page;
> +		swp_entry_t swp;
> +		spinlock_t *ptl;
> +		pgd_t *pgd;
> +		pud_t *pud;
> +		pmd_t *pmd;
> +		pte_t *pte;
> +
> +		pgd = pgd_offset(vma->vm_mm, pos);
> +		if (!pgd_present(*pgd))
> +			continue;
> +		pud = pud_offset(pgd, pos);
> +		if (!pud_present(*pud))
> +			continue;
> +		pmd = pmd_offset(pud, pos);
> +		if (!pmd_present(*pmd))
> +			continue;
> +		pte = pte_offset_map_lock(vma->vm_mm, pmd, pos, &ptl);
> +		if (!is_swap_pte(*pte)) {
> +			pte_unmap_unlock(pte, ptl);
> +			continue;
> +		}
> +		swp = pte_to_swp_entry(*pte);
> +		pte_unmap_unlock(pte, ptl);
> +
> +		if (swp_type(swp) != swp_type(entry))
> +			continue;
> +		/*
> +		 * Dont move the disk head too far away.  This also
> +		 * throttles readahead while thrashing, where virtual
> +		 * order diverges more and more from physical order.
> +		 */
> +		if (swp_offset(swp) > pmax)
> +			continue;
> +		if (swp_offset(swp) < pmin)
> +			continue;
> +		page = read_swap_cache_async(swp, gfp_mask, vma, pos);
> +		if (!page)
> +			continue;
> +		page_cache_release(page);
> +	}
> +	lru_add_drain();	/* Push any new pages onto the LRU now */
> +	return read_swap_cache_async(entry, gfp_mask, vma, addr);
> +}
View attachment "PATCH" of type "TEXT/PLAIN" (13630 bytes)