lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Wed, 13 Mar 2013 15:44:59 +0900
From:	Minchan Kim <minchan.kim@....com>
To:	Paul Turner <pjt@...gle.com>
Cc:	"linux-mm@...ck.org" <linux-mm@...ck.org>,
	LKML <linux-kernel@...r.kernel.org>,
	Michael Kerrisk <mtk.manpages@...il.com>,
	Arun Sharma <asharma@...com>,
	John Stultz <john.stultz@...aro.org>,
	Mel Gorman <mel@....ul.ie>, Hugh Dickins <hughd@...gle.com>,
	Dave Hansen <dave@...ux.vnet.ibm.com>,
	Rik van Riel <riel@...hat.com>, Neil Brown <neilb@...e.de>,
	Mike Hommey <mh@...ndium.org>, Taras Glek <tglek@...illa.com>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	Jason Evans <je@...com>, Sanjay Ghemawat <sanjay@...gle.com>,
	Johannes Weiner <hannes@...xchg.org>,
	Michel Lespinasse <walken@...gle.com>,
	Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [RFC v7 00/11] Support vrange for anonymous page

On Tue, Mar 12, 2013 at 04:16:57PM -0700, Paul Turner wrote:
> On Tue, Mar 12, 2013 at 12:38 AM, Minchan Kim <minchan@...nel.org> wrote:
> > First of all, let's define the term.
> > From now on, I'd like to call it as vrange(a.k.a volatile range)
> > for anonymous page. If you have a better name in mind, please suggest.
> >
> > This version is still *RFC* because it's just quick prototype so
> > it doesn't support THP/HugeTLB/KSM and even couldn't build on !x86.
> > Before further sorting out issues, I'd like to post current direction
> > and discuss it. Of course, I'd like to extend this discussion in
> > comming LSF/MM.
> >
> > In this version, I changed lots of thing, expecially removed vma-based
> > approach because it needs write-side lock for mmap_sem, which will drop
> > performance in mutli-threaded big SMP system, KOSAKI pointed out.
> > And vma-based approach is hard to meet requirement of new system call by
> > John Stultz's suggested semantic for consistent purged handling.
> > (http://linux-kernel.2935.n7.nabble.com/RFC-v5-0-8-Support-volatile-for-anonymous-range-tt575773.html#none)
> >
> > I tested this patchset with modified jemalloc allocator which was
> > leaded by Jason Evans(jemalloc author) who was interest in this feature
> > and was happy to port his allocator to use new system call.
> > Super Thanks Jason!
> >
> > The benchmark for test is ebizzy. It have been used for testing the
> > allocator performance so it's good for me. Again, thanks for recommending
> > the benchmark, Jason.
> > (http://people.freebsd.org/~kris/scaling/ebizzy.html)
> >
> > The result is good on my machine (12 CPU, 1.2GHz, DRAM 2G)
> >
> >         ebizzy -S 20
> >
> > jemalloc-vanilla: 52389 records/sec
> > jemalloc-vrange: 203414 records/sec
> >
> >         ebizzy -S 20 with background memory pressure
> >
> > jemalloc-vanilla: 40746 records/sec
> > jemalloc-vrange: 174910 records/sec
> >
> > And it's much improved on KVM virtual machine.
> >
> > This patchset is based on v3.9-rc2
> >
> > - What's the sys_vrange(addr, length, mode, behavior)?
> >
> >   It's a hint that user deliver to kernel so kernel can *discard*
> >   pages in a range anytime. mode is one of VRANGE_VOLATILE and
> >   VRANGE_NOVOLATILE. VRANGE_NOVOLATILE is memory pin operation so
> >   kernel coudn't discard any pages any more while VRANGE_VOLATILE
> >   is memory unpin opeartion so kernel can discard pages in vrange
> >   anytime. At a moment, behavior is one of VRANGE_FULL and VRANGE
> >   PARTIAL. VRANGE_FULL tell kernel that once kernel decide to
> >   discard page in a vrange, please, discard all of pages in a
> >   vrange selected by victim vrange. VRANGE_PARTIAL tell kernel
> >   that please discard of some pages in a vrange. But now I didn't
> >   implemented VRANGE_PARTIAL handling yet.
> >
> > - What happens if user access page(ie, virtual address) discarded
> >   by kernel?
> >
> >   The user can encounter SIGBUS.
> >
> > - What should user do for avoding SIGBUS?
> >   He should call vrange(addr, length, VRANGE_NOVOLATILE, mode) before
> >   accessing the range which was called
> >   vrange(addr, length, VRANGE_VOLATILE, mode)
> >
> > - What happens if user access page(ie, virtual address) doesn't
> >   discarded by kernel?
> >
> >   The user can see vaild data which was there before calling
> > vrange(., VRANGE_VOLATILE) without page fault.
> >
> > - What's different with madvise(DONTNEED)?
> >
> >   System call semantic
> >
> >   DONTNEED makes sure user always can see zero-fill pages after
> >   he calls madvise while vrange can see data or encounter SIGBUS.
> >
> >   Internal implementation
> >
> >   The madvise(DONTNEED) should zap all mapped pages in range so
> >   overhead is increased linearly with the number of mapped pages.
> >   Even, if user access zapped pages as write mode, page fault +
> >   page allocation + memset should be happened.
> >
> >   The vrange just register a address range instead of zapping all of pte
> >   n the vma so it doesn't touch ptes any more.
> >
> > - What's the benefit compared to DONTNEED?
> >
> >   1. The system call overhead is smaller because vrange just registers
> >      a range using interval tree instead of zapping all the page in a range
> >      so overhead should be really cheap.
> >
> >   2. It has a chance to eliminate overheads (ex, zapping pte + page fault
> >      + page allocation + memset(PAGE_SIZE)) if memory pressure isn't
> >      severe.
> >
> >   3. It has a potential to zap all ptes and free the pages if memory
> >      pressure is severe so discard scanning overhead could be smaller - TODO
> >
> > - What's for targetting?
> >
> >   Firstly, user-space allocator like ptmalloc, jemalloc or heap management
> >   of virtual machine like Dalvik. Also, it comes in handy for embedded
> >   which doesn't have swap device so they can't reclaim anonymous pages.
> >   By discarding instead of swapout, it could be used in the non-swap system.
> 
> I think that another potentially useful use-case would be using this
> -- or a similar API -- to opportunistically return deep user stack
> frames.
> 
> This is another place where we strongly care about the time-to-free as
> well as the time-to-reallocate in the case of relatively immediate
> re-use.

Indeed. Great idea!
Thanks, Paul.

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ