linux-kernel - Re: missing madvise functionality

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Wed, 04 Apr 2007 17:46:12 +1000
From:	Nick Piggin <nickpiggin@...oo.com.au>
To:	Ulrich Drepper <drepper@...hat.com>
CC:	Rik van Riel <riel@...hat.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Linux Kernel <linux-kernel@...r.kernel.org>,
	Jakub Jelinek <jakub@...hat.com>,
	Linux Memory Management <linux-mm@...ck.org>
Subject: Re: missing madvise functionality

Ulrich Drepper wrote:
> People might remember the thread about mysql not scaling and pointing
> the finger quite happily at glibc.  Well, the situation is not like that.
> 
> The problem is glibc has to work around kernel limitations.  If the
> malloc implementation detects that a large chunk of previously allocated
> memory is now free and unused it wants to return the memory to the
> system.  What we currently have to do is this:
> 
>   to free:      mmap(PROT_NONE) over the area
>   to reuse:     mprotect(PROT_READ|PROT_WRITE)
> 
> Yep, that's expensive, both operations need to get locks preventing
> other threads from doing the same.
> 
> Some people were quick to suggest that we simply avoid the freeing in
> many situations (that's what the patch submitted by Yanmin Zhang
> basically does).  That's no solution.  One of the very good properties
> of the current allocator is that it does not use much memory.

Does mmap(PROT_NONE) actually free the memory?


> A solution for this problem is a madvise() operation with the following
> property:
> 
>   - the content of the address range can be discarded
> 
>   - if an access to a page in the range happens in the future it must
>     succeed.  The old page content can be provided or a new, empty page
>     can be provided
> 
> That's it.  The current MADV_DONTNEED doesn't cut it because it zaps the
> pages, causing *all* future reuses to create page faults.  This is what
> I guess happens in the mysql test case where the pages where unused and
> freed but then almost immediately reused.  The page faults erased all
> the benefits of using one mprotect() call vs a pair of mmap()/mprotect()
> calls.

Two questions.

In the case of pages being unused then almost immediately reused, why is
it a bad solution to avoid freeing? Is it that you want to avoid
heuristics because in some cases they could fail and end up using memory?

Secondly, why is MADV_DONTNEED bad? How much more expensive is a pagefault
than a syscall? (including the cost of the TLB fill for the memory access
after the syscall, of course).

zapping the pages puts them on a nice LIFO cache hot list of pages that
can be quickly used when the next fault comes in, or used for any other
allocation in the kernel. Putting them on some sort of reclaim list seems
a bit pointless.

Oh, also: something like this patch would help out MADV_DONTNEED, as it
means it can run concurrently with page faults. I think the locking will
work (but needs forward porting).

-- 
SUSE Labs, Novell Inc.

View attachment "madv-mmap_sem.patch" of type "text/plain" (1306 bytes)