linux-kernel - Re: [RFC v2] Support volatile range for anon vma

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20121031143524.0509665d.akpm@linux-foundation.org>
Date:	Wed, 31 Oct 2012 14:35:24 -0700
From:	Andrew Morton <akpm@...ux-foundation.org>
To:	Minchan Kim <minchan@...nel.org>
Cc:	linux-kernel@...r.kernel.org, linux-mm@...ck.org,
	John Stultz <john.stultz@...aro.org>,
	Christoph Lameter <cl@...ux.com>,
	Android Kernel Team <kernel-team@...roid.com>,
	Robert Love <rlove@...gle.com>, Mel Gorman <mel@....ul.ie>,
	Hugh Dickins <hughd@...gle.com>,
	Dave Hansen <dave@...ux.vnet.ibm.com>,
	Rik van Riel <riel@...hat.com>,
	Dave Chinner <david@...morbit.com>, Neil Brown <neilb@...e.de>,
	Mike Hommey <mh@...ndium.org>, Taras Glek <tglek@...illa.com>,
	KOSAKI Motohiro <kosaki.motohiro@...il.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	sanjay@...gle.com, pjt@...gle.com,
	David Rientjes <rientjes@...gle.com>
Subject: Re: [RFC v2] Support volatile range for anon vma

On Tue, 30 Oct 2012 10:29:54 +0900
Minchan Kim <minchan@...nel.org> wrote:

> This patch introudces new madvise behavior MADV_VOLATILE and
> MADV_NOVOLATILE for anonymous pages. It's different with
> John Stultz's version which considers only tmpfs while this patch
> considers only anonymous pages so this cannot cover John's one.
> If below idea is proved as reasonable, I hope we can unify both
> concepts by madvise/fadvise.
> 
> Rationale is following as.
> Many allocators call munmap(2) when user call free(3) if ptr is
> in mmaped area. But munmap isn't cheap because it have to clean up
> all pte entries and unlinking a vma so overhead would be increased
> linearly by mmaped area's size.

Presumably the userspace allocator will internally manage memory in
large chunks, so the munmap() call frequency will be much lower than
the free() call frequency.  So the performance gains from this change
might be very small.

The whole point of the patch is to improve performance, but we have no
evidence that it was successful in doing that!  I do think we'll need
good quantitative testing results before proceeding with such a patch,
please.

Also, it is very desirable that we involve the relevant userspace
(glibc, etc) developers in this.  And I understand that the google
tcmalloc project will probably have interest in this - I've cc'ed
various people@...gle in the hope that they can provide input (please).

Also, it is a userspace API change.  Please cc mtk.manpages@...il.com.

Also, I assume that you have userspace test code.  At some stage,
please consider adding a case to tools/testing/selftests.  Such a test
would require to creation of memory pressure, which is rather contrary
to the selftests' current philosopy of being a bunch of short-running
little tests.  Perhaps you can come up with something.  But I suggest
that such work be done later, once it becomes clearer that this code is
actually headed into the kernel.

> Allocator should call madvise(MADV_NOVOLATILE) before reusing for
> allocating that area to user. Otherwise, accessing of volatile range
> will meet SIGBUS error.

Well, why?  It would be easy enough for the fault handler to give
userspace a new, zeroed page at that address.

Or we could simply leave the old page in place at that address.  If the
page gets touched, we clear MADV_NOVOLATILE on its VMA and give the
page (or all the not-yet-reclaimed pages) back to userspace at their
old addresses.

Various options suggest themselves here.  You've chosen one of them but
I would like to see a pretty exhaustive description of the reasoning
behind that decision.

Also, I wonder about the interaction with other vma manipulation
operations.  For example, can a VMA get split when in the MADV_VOLATILE
state?  If so, what happens?  

Also, I see no reason why the code shouldn't work OK with nonlinear VMAs,
but I bet this wasn't tested ;)

> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -86,6 +86,22 @@ static long madvise_behavior(struct vm_area_struct * vma,
>  		if (error)
>  			goto out;
>  		break;
> +	case MADV_VOLATILE:
> +		if (vma->vm_flags & VM_LOCKED) {
> +			error = -EINVAL;
> +			goto out;
> +		}
> +		new_flags |= VM_VOLATILE;
> +		vma->purged = false;
> +		break;
> +	case MADV_NOVOLATILE:
> +		if (!(vma->vm_flags & VM_VOLATILE)) {
> +			error = -EINVAL;
> +			goto out;

I wonder if this really should return an error.  Other madvise()
options don't do this, and running MADV_NOVOLATILE against a
not-volatile area seems pretty benign and has clearly defined before-
and after- states.

> +		}
> +
> +		new_flags &= ~VM_VOLATILE;
> +		break;
>  	}
>  
>  	if (new_flags == vma->vm_flags) {

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/