linux-kernel - Re: [RFC v2] Support volatile range for anon vma

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAPM31RKm89s6PaAnfySUD-f+eGdoZP6=9DHy58tx_4Zi8Z9WPQ@mail.gmail.com>
Date:	Wed, 31 Oct 2012 14:59:07 -0700
From:	Paul Turner <pjt@...gle.com>
To:	Andrew Morton <akpm@...ux-foundation.org>
Cc:	Minchan Kim <minchan@...nel.org>, linux-kernel@...r.kernel.org,
	linux-mm@...ck.org, John Stultz <john.stultz@...aro.org>,
	Christoph Lameter <cl@...ux.com>,
	Android Kernel Team <kernel-team@...roid.com>,
	Robert Love <rlove@...gle.com>, Mel Gorman <mel@....ul.ie>,
	Hugh Dickins <hughd@...gle.com>,
	Dave Hansen <dave@...ux.vnet.ibm.com>,
	Rik van Riel <riel@...hat.com>,
	Dave Chinner <david@...morbit.com>, Neil Brown <neilb@...e.de>,
	Mike Hommey <mh@...ndium.org>, Taras Glek <tglek@...illa.com>,
	KOSAKI Motohiro <kosaki.motohiro@...il.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	sanjay@...gle.com, David Rientjes <rientjes@...gle.com>
Subject: Re: [RFC v2] Support volatile range for anon vma

On Wed, Oct 31, 2012 at 2:35 PM, Andrew Morton
<akpm@...ux-foundation.org> wrote:
>
> On Tue, 30 Oct 2012 10:29:54 +0900
> Minchan Kim <minchan@...nel.org> wrote:
>
> > This patch introudces new madvise behavior MADV_VOLATILE and
> > MADV_NOVOLATILE for anonymous pages. It's different with
> > John Stultz's version which considers only tmpfs while this patch
> > considers only anonymous pages so this cannot cover John's one.
> > If below idea is proved as reasonable, I hope we can unify both
> > concepts by madvise/fadvise.
> >
> > Rationale is following as.
> > Many allocators call munmap(2) when user call free(3) if ptr is
> > in mmaped area. But munmap isn't cheap because it have to clean up
> > all pte entries and unlinking a vma so overhead would be increased
> > linearly by mmaped area's size.
>
> Presumably the userspace allocator will internally manage memory in
> large chunks, so the munmap() call frequency will be much lower than
> the free() call frequency.  So the performance gains from this change
> might be very small.

I don't think I strictly understand the motivation from a
malloc-standpoint here.

These days we (tcmalloc) use madvise(..., MADV_DONTNEED) when we want
to perform discards on Linux.    For any reasonable allocator (short
of binding malloc --> mmap, free --> unmap) this seems a better
choice.

Note also from a performance stand-point I doubt any allocator (which
case about performance) is going to want to pay the cost of even a
null syscall about typical malloc/free usage (consider: a tcmalloc
malloc/free pairis currently <20ns).  Given then that this cost is
amortized once you start doing discards on larger blocks MADV_DONTNEED
seems a preferable interface:
- You don't need to reconstruct an arena when you do want to allocate
since there's no munmap/mmap for the region to change about
- There are no syscalls involved in later reallocating the block.

The only real additional cost is address-space.  Are you strongly
concerned about the 32-bit case?

>
> The whole point of the patch is to improve performance, but we have no
> evidence that it was successful in doing that!  I do think we'll need
> good quantitative testing results before proceeding with such a patch,
> please.
>
> Also, it is very desirable that we involve the relevant userspace
> (glibc, etc) developers in this.  And I understand that the google
> tcmalloc project will probably have interest in this - I've cc'ed
> various people@...gle in the hope that they can provide input (please).
>
> Also, it is a userspace API change.  Please cc mtk.manpages@...il.com.
>
> Also, I assume that you have userspace test code.  At some stage,
> please consider adding a case to tools/testing/selftests.  Such a test
> would require to creation of memory pressure, which is rather contrary
> to the selftests' current philosopy of being a bunch of short-running
> little tests.  Perhaps you can come up with something.  But I suggest
> that such work be done later, once it becomes clearer that this code is
> actually headed into the kernel.
>
> > Allocator should call madvise(MADV_NOVOLATILE) before reusing for
> > allocating that area to user. Otherwise, accessing of volatile range
> > will meet SIGBUS error.
>
> Well, why?  It would be easy enough for the fault handler to give
> userspace a new, zeroed page at that address.

Note: MADV_DONTNEED already has this (nice) property.

>
> Or we could simply leave the old page in place at that address.  If the
> page gets touched, we clear MADV_NOVOLATILE on its VMA and give the
> page (or all the not-yet-reclaimed pages) back to userspace at their
> old addresses.
>
> Various options suggest themselves here.  You've chosen one of them but
> I would like to see a pretty exhaustive description of the reasoning
> behind that decision.
>
> Also, I wonder about the interaction with other vma manipulation
> operations.  For example, can a VMA get split when in the MADV_VOLATILE
> state?  If so, what happens?
>
> Also, I see no reason why the code shouldn't work OK with nonlinear VMAs,
> but I bet this wasn't tested ;)
>
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -86,6 +86,22 @@ static long madvise_behavior(struct vm_area_struct * vma,
> >               if (error)
> >                       goto out;
> >               break;
> > +     case MADV_VOLATILE:
> > +             if (vma->vm_flags & VM_LOCKED) {
> > +                     error = -EINVAL;
> > +                     goto out;
> > +             }
> > +             new_flags |= VM_VOLATILE;
> > +             vma->purged = false;
> > +             break;
> > +     case MADV_NOVOLATILE:
> > +             if (!(vma->vm_flags & VM_VOLATILE)) {
> > +                     error = -EINVAL;
> > +                     goto out;
>
> I wonder if this really should return an error.  Other madvise()
> options don't do this, and running MADV_NOVOLATILE against a
> not-volatile area seems pretty benign and has clearly defined before-
> and after- states.
>
> > +             }
> > +
> > +             new_flags &= ~VM_VOLATILE;
> > +             break;
> >       }
> >
> >       if (new_flags == vma->vm_flags) {
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/