linux-kernel - Re: [PATCH 00/28] Swap over NFS -v16

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1205140674.8514.152.camel@twins>
Date:	Mon, 10 Mar 2008 10:17:54 +0100
From:	Peter Zijlstra <a.p.zijlstra@...llo.nl>
To:	Neil Brown <neilb@...e.de>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org,
	netdev@...r.kernel.org, trond.myklebust@....uio.no,
	Pekka Enberg <penberg@...helsinki.fi>
Subject: Re: [PATCH 00/28] Swap over NFS -v16

On Mon, 2008-03-10 at 16:15 +1100, Neil Brown wrote:

> > On Fri, 2008-03-07 at 14:33 +1100, Neil Brown wrote:
> > > 
> > > [I don't find the above wholly satisfying.  There seems to be too much
> > >  hand-waving.  If someone can provide better text explaining why
> > >  swapout is a special case, that would be great.]
> > 
> > Anonymous pages are dirty by definition (except the zero page, but I
> > think we recently ditched it). So shrinking of the anonymous pool will
> > require swapping.
> 
> Well, there is the swap cache.  That's probably what I was thinking of
> when I said "clean anonymous pages".  I suspect they are the first to
> go!

Ah, right, we could consider those clean anonymous. Alas, they are just
part of the aging lists and do not get special priority.

> > It is indeed the last refuge for those with GFP_NOFS. Allong with the
> > strict limit on the amount of dirty file pages it also ensures writing
> > those out will never deadlock the machine as there are always clean file
> > pages and or anonymous pages to launder.
> 
> The difficulty I have is justifying exactly why page-cache writeout
> will not deadlock.  What if all the memory that is not dirty-pagecache
> is anonymous, and if swap isn't enabled?

Ah, I never considered the !SWAP case.

> Maybe the number returned by "determine_dirtyable_memory" in
> page-writeback.c excludes anonymous pages?  I wonder if the meaning of
> NR_FREE_PAGES, NR_INACTIVE, etc is documented anywhere....

I don't think they are, but it should be obvious once you know the VM,
har har har :-)

NR_FREE_PAGES are the pages in the page allocators free lists.
NR_INACTIVE are the pages on the inactive list
NR_ACTIVE are the pageso on the active list

NR_INACTIVE+NR_ACTIVE are the number of pages on the page reclaim lists.

So, if you consider !SWAP, we could get in a deadlock when all of memory
is anonymous except for a few (<=dirty limit) dirty file pages.

But I guess the !SWAP people know what they're doing, large anon usage
without swap is asking for trouble.
 
> > Right. I've had a long conversation on PG_emergency with Pekka. And I
> > think the conclusion was that PG_emergency will create more head-aches
> > than it solves. I probably have the conversation in my IRC logs and
> > could email it if you're interested (and Pekka doesn't object).
> 
> Maybe that depends on the exact semantic of PG_emergency ??
> I remember you being concerned that PG_emergency never changes between
> allocation and freeing, and that wouldn't work well with slub.
> My envisioned semantic has it possibly changing quite often.
> What it means is:
>    The last allocation done from this page was in a low-memory
>    condition.

Yes, that works, except that we'd need to iterate all pages and clear
PG_emergency - which would imply tracking all these pages etc..

Hence it would be better not to keep persistent state and do as we do
now; use some non-persistent state on allocation.

> You really need some way to tell if the result of kmalloc/kmemalloc
> should be treated as reserved.
> I think you had code which first tried the allocation without
> GFP_MEMALLOC and then if that failed, tried again *with*
> GFP_MEMALLOC.  If that then succeeded, it is assumed to be an
> allocation from reserves.  That seemed rather ugly, though I guess you
> could wrap it in a function to hide the ugliness:
> 
> void *kmalloc_reserve(size_t size, int *reserve, gfp_t gfp_flags)
> {
> 	void *result = kmalloc(size, gfp_flags & ~GFP_MEMALLOC);
> 	if (result) {
> 		*reserve = 0;
> 		return result;
> 	}
> 	result = kmalloc(size, gfp_flags | GFP_MEMALLOC);
> 	if (result) {
> 		*reserve = 1;
> 		return result;
> 	}
> 	return NULL;
> }
> ???

Yeah, I this this is the best we can do, just split this part out into
helper functions. I've been thinking of doing this - just haven't gotten
around to implementing it. Hope to do so this week and send out a new
series.

> > I've already heard interest from other people to use these hooks to
> > provide swap on other non-block filesystems such as jffs2, logfs and the
> > like.
> 
> I'm interested in the swap_in/swap_out interface for external
> write-intent bitmaps for md/raid arrays.
> You can have a write-intent bitmap which records which blocks might be
> dirty if the host crashes, so that resync is much faster.
> It can be stored in a file in a separate filesystem, but that is
> currently implemented by using bmap to enumerate the blocks and then
> reading/writing directly to the device (like swap).  Your interface
> would be much nicer for that (not that I think having a
> write-intent-bitmap on an NFS filesystem would be a clever idea ;-)

Hmm, right. But for that purpose the names swap_* are a tad misleading.
I remember hch mentioning this at some point. What would be a more
suitable naming scheme so we can both use it?

> I'll look forward to your next patch set....
> 
> One thing I had thought odd while reading the patches, but haven't
> found an opportunity to mention before, is the "IS_SWAPFILE" test in
> nfs-swapper.patch.
> This seems like a layering violation.  It would be better if the test
> was based on whether  ->swapfile had been called on the file.  That way
> my write-intent-bitmaps would get the same benefit.

I'll look into this, I didn't thing using a inode test inside a
filesystem implementation was too weird..

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/