linux-kernel - RE: [RFC] mm: remove swapcache page early

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.LNX.2.00.1303271541200.30535@eggly.anvils>
Date:	Wed, 27 Mar 2013 16:16:48 -0700 (PDT)
From:	Hugh Dickins <hughd@...gle.com>
To:	Dan Magenheimer <dan.magenheimer@...cle.com>
cc:	Minchan Kim <minchan@...nel.org>,
	Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org,
	Seth Jennings <sjenning@...ux.vnet.ibm.com>,
	Nitin Gupta <ngupta@...are.org>,
	Konrad Rzeszutek Wilk <konrad@...nok.org>,
	Shaohua Li <shli@...nel.org>,
	Kamezawa Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	Bob Liu <bob.liu@...cle.com>
Subject: RE: [RFC] mm: remove swapcache page early

On Wed, 27 Mar 2013, Dan Magenheimer wrote:
> > From: Hugh Dickins [mailto:hughd@...gle.com]
> > Subject: Re: [RFC] mm: remove swapcache page early
> > 
> > On Wed, 27 Mar 2013, Minchan Kim wrote:
> > 
> > > Swap subsystem does lazy swap slot free with expecting the page
> > > would be swapped out again so we can't avoid unnecessary write.
> >                              so we can avoid unnecessary write.
> > >
> > > But the problem in in-memory swap is that it consumes memory space
> > > until vm_swap_full(ie, used half of all of swap device) condition
> > > meet. It could be bad if we use multiple swap device, small in-memory swap
> > > and big storage swap or in-memory swap alone.
> > 
> > That is a very good realization: it's surprising that none of us
> > thought of it before - no disrespect to you, well done, thank you.
> 
> Yes, my compliments also Minchan.  This problem has been thought of before
> but this patch is the first to identify a possible solution.
>  
> > And I guess swap readahead is utterly unhelpful in this case too.
> 
> Yes... as is any "swap writeahead".  Excuse my ignorance, but I
> think this is not done in the swap subsystem but instead the kernel
> assumes write-coalescing will be done in the block I/O subsystem,
> which means swap writeahead would affect zram but not zcache/zswap
> (since frontswap subverts the block I/O subsystem).

I don't know what swap writeahead is; but write coalescing, yes.
I don't see any problem with it in this context.

> 
> However I think a swap-readahead solution would be helpful to
> zram as well as zcache/zswap.

Whereas swap readahead on zmem is uncompressing zmem to pagecache
which may never be needed, and may take a circuit of the inactive
LRU before it gets reclaimed (if it turns out not to be needed,
at least it will remain clean and be easily reclaimed).

> 
> > > This patch changes vm_swap_full logic slightly so it could free
> > > swap slot early if the backed device is really fast.
> > > For it, I used SWP_SOLIDSTATE but It might be controversial.
> > 
> > But I strongly disagree with almost everything in your patch :)
> > I disagree with addressing it in vm_swap_full(), I disagree that
> > it can be addressed by device, I disagree that it has anything to
> > do with SWP_SOLIDSTATE.
> > 
> > This is not a problem with swapping to /dev/ram0 or to /dev/zram0,
> > is it?  In those cases, a fixed amount of memory has been set aside
> > for swap, and it works out just like with disk block devices.  The
> > memory set aside may be wasted, but that is accepted upfront.
> 
> It is (I believe) also a problem with swapping to ram.  Two
> copies of the same page are kept in memory in different places,
> right?  Fixed vs variable size is irrelevant I think.  Or am
> I misunderstanding something about swap-to-ram?

I may be misrembering how /dev/ram0 works, or simply assuming that
if you want to use it for swap (interesting for testing, but probably
not for general use), then you make sure to allocate each page of it
in advance.

The pages of /dev/ram0 don't get freed, or not before it's closed
(swapoff'ed) anyway.  Yes, swapcache would be duplicating data from
other memory into /dev/ram0 memory; but that /dev/ram0 memory has
been set aside for this purpose, and removing from swapcache won't
free any more memory.

> 
> > Similarly, this is not a problem with swapping to SSD.  There might
> > or might not be other reasons for adjusting the vm_swap_full() logic
> > for SSD or generally, but those have nothing to do with this issue.
> 
> I think it is at least highly related.  The key issue is the
> tradeoff of the likelihood that the page will soon be read/written
> again while it is in swap cache vs the time/resource-usage necessary
> to "reconstitute" the page into swap cache.  Reconstituting from disk
> requires a LOT of elapsed time.  Reconstituting from
> an SSD likely takes much less time.  Reconstituting from
> zcache/zram takes thousands of CPU cycles.

I acknowledge my complete ignorance of how to judge the tradeoff
between memory usage and cpu usage, but I think Minchan's main
concern was with the memory usage.  Neither hard disk nor SSD
is occupying memory.

> 
> > The problem here is peculiar to frontswap, and the variably sized
> > memory behind it, isn't it?  We are accustomed to using swap to free
> > up memory by transferring its data to some other, cheaper but slower
> > resource.
> 
> Frontswap does make the problem more complex because some pages
> are in "fairly fast" storage (zcache, needs decompression) and
> some are on the actual (usually) rotating media.  Fortunately,
> differentiating between these two cases is just a table lookup
> (see frontswap_test).
> 
> > But in the case of frontswap and zmem (I'll say that to avoid thinking
> > through which backends are actually involved), it is not a cheaper and
> > slower resource, but the very same memory we are trying to save: swap
> > is stolen from the memory under reclaim, so any duplication becomes
> > counter-productive (if we ignore cpu compression/decompression costs:
> > I have no idea how fair it is to do so, but anyone who chooses zmem
> > is prepared to pay some cpu price for that).
> 
> Exactly.  There is some "robbing of Peter to pay Paul" and
> other complex resource tradeoffs.  Presumably, though, it is
> not "the very same memory we are trying to save" but a
> fraction of it, saving the same page of data more efficiently
> in memory, using less than a page, at some CPU cost.

Yes, I'm not saying that frontswap/zmem is pointless: just agreeing
with Minchan that in this case the duplication inherent in swapcache
can be waste of memory that we should try to avoid.

> 
> > And because it's a frontswap thing, we cannot decide this by device:
> > frontswap may or may not stand in front of each device.  There is no
> > problem with swapcache duplicated on disk (until that area approaches
> > being full or fragmented), but at the higher level we cannot see what
> > is in zmem and what is on disk: we only want to free up the zmem dup.
> 
> I *think* frontswap_test(page) resolves this problem, as long as
> we have a specific page available to use as a parameter.
> 
> > I believe the answer is for frontswap/zmem to invalidate the frontswap
> > copy of the page (to free up the compressed memory when possible) and
> > SetPageDirty on the PageUptodate PageSwapCache page when swapping in
> > (setting page dirty so nothing will later go to read it from the
> > unfreed location on backing swap disk, which was never written).
> 
> There are two duplication issues:  (1) When can the page be removed
> from the swap cache after a call to frontswap_store; and (2) When
> can the page be removed from the frontswap storage after it
> has been brought back into memory via frontswap_load.
> 
> This patch from Minchan addresses (1).

Ying Han was reminding me of this case a couple of hours ago, we don't
see a problem there: when frontswap_store() succeeds, there's an
end_page_writeback() as there should be, and shrink_page_list()
should reclaim the page immediately.  So I think (1) is already
handled and Minchan was not trying to address it.

> The issue you are raising
> here is (2).  You may not know that (2) has recently been solved
> in frontswap, at least for zcache.  See frontswap_exclusive_gets_enabled.
> If this is enabled (and it is for zcache but not yet for zswap),
> what you suggest (SetPageDirty) is what happens.

Ah, and I have a dim, perhaps mistaken, memory that I gave you
input on that before, suggesting the SetPageDirty.  Good, sounds
like the solution is already in place, if not actually activated.

Thanks, must dash,
Hugh

> 
> > We cannot rely on freeing the swap itself, because in general there
> > may be multiple references to the swap, and we only satisfy the one
> > which has faulted.  It may or may not be a good idea to use rmap to
> > locate the other places to insert pte in place of swap entry, to
> > resolve them all at once; but we have chosen not to do so in the
> > past, and there's no need for that, if the zmem gets invalidated
> > and the swapcache page set dirty.
> 
> I see.  Minchan's patch handles the removal "reactively"... it
> might be possible to handle it more proactively.  Or it may
> be possible to take the number of references into account when
> deciding whether to frontswap_store the page as, presumably,
> the likelihood of needing to "reconstitute" the page sooner increases
> with each additional reference.
> 
> > Hugh
> 
> Very useful thoughts, Hugh.  Thanks much and looking forward
> to more discussion at LSF/MM!
> 
> Dan
> 
> P.S. When I refer to zcache, I am referring to the version in
> drivers/staging/zcache in 3.9.  The code in drivers/staging/zcache
> in 3.8 is "old zcache"... "new zcache" is in drivers/staging/ramster
> in 3.8.  Sorry for any confusion...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/