linux-kernel - Re: KSM: current madvise rollup

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <4A5B7CC5.4030908@redhat.com>
Date:	Mon, 13 Jul 2009 21:28:21 +0300
From:	Izik Eidus <ieidus@...hat.com>
To:	Hugh Dickins <hugh.dickins@...cali.co.uk>
CC:	Andrea Arcangeli <aarcange@...hat.com>,
	Rik van Riel <riel@...hat.com>,
	Chris Wright <chrisw@...hat.com>,
	Nick Piggin <nickpiggin@...oo.com.au>,
	Andrew Morton <akpm@...ux-foundation.org>,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: KSM: current madvise rollup

Hugh Dickins wrote:
> On Sun, 12 Jul 2009, Izik Eidus wrote:
>   
>> On Sat, 11 Jul 2009 20:22:11 +0100 (BST)
>> Hugh Dickins <hugh.dickins@...cali.co.uk> wrote:
>>     
>>> We may want to do that anyway.  It concerned me a lot when I was
>>> first testing (and often saw kernel_pages_allocated greater than
>>> pages_shared - probably because of the original KSM's eagerness to
>>> merge forked pages, though I think there may have been more to it
>>> than that).  But seems much less of an issue now (that ratio is much
>>> healthier), and even less of an issue once KSM pages can be swapped.
>>> So I'm not bothering about it at the moment, but it may make sense.
>>>       
>
> I realized since writing that with the current statistics you really
> cannot tell how big an issue the orphaned (count 1) KSM pages are -
> good sharing of a few will completely hide non-sharing of many.
>
> But I've hacked in more stats (not something I'd care to share yet!),
> and those confirm that for my loads at least, the orphaned KSM pages
> are few compared with the shared ones.
>
>   
>> We could add patch like the below, but I think we should leave it as it
>> is now,
>>     
>
> I agree we should leave it as is for now.  My guess is that we'll
> prefer to leave them around, until approaching max_kernel_pages_alloc,
> pruning them only at that stage (rather as we free swap more aggressively
> when it's 50% full).  There may be benefit in not removing them too soon,
> there may be benefit in holding on to stable pages for longer (holding a
> reference in the stable tree for a while).  Or maybe not, just an idea.
>   

Well, I personaly dont see any real soultion to this problem without 
real swapping support,
So for now I dont mind not to deal with it at all (as long as the admin 
can decide how many unswappable pages he allow)
But, If you have a stronger opnion than me for that case, I really dont 
care to change it.

>   
>> and solve it all (like you have said) with the ksm pages
>> swapping support in next kernel release.
>> (Right now ksm can limit itself with max_kernel_pages_alloc)
>>
>> diff --git a/mm/ksm.c b/mm/ksm.c
>> index a0fbdb2..ee80861 100644
>> --- a/mm/ksm.c
>> +++ b/mm/ksm.c
>> @@ -1261,8 +1261,13 @@ static void ksm_do_scan(unsigned int scan_npages)
>>  		rmap_item = scan_get_next_rmap_item(&page);
>>  		if (!rmap_item)
>>  			return;
>> -		if (!PageKsm(page) || !in_stable_tree(rmap_item))
>> +		if (!PageKsm(page) || !in_stable_tree(rmap_item)) {
>>  			cmp_and_merge_page(page, rmap_item);
>> +		} else if (page_mapcount(page) == 0) {
>>     
>
> If we did that (but we agree not for now), shouldn't it be
> 			   page_mapcount(page) == 1
> ?  The mapcount 0 ones already got freed by the zap/unmap code.
>   

Yea, I dont know what i thought when i compare it to zero..., 1 is the 
right value here indeed.

>   
>> +			break_cow(rmap_item->mm,
>> +				  rmap_item->address & PAGE_MASK);
>>     
>
> Just a note on that " & PAGE_MASK": it's unnecessary there and
> almost everywhere else.  One of the pleasures of putting flags into
> the bottom bits of the address, in code concerned with faulting, is
> that the faulting address can be anywhere within the page, so we
> don't have to bother to mask off the flags.
>
>   
>> +			remove_rmap_item_from_tree(rmap_item);
>> +		}
>>  		put_page(page);
>>  	}
>>  }
>>
>>     
>>> Oh, something that might be making it higher, that I didn't highlight
>>> (and can revert if you like, it was just more straightforward this
>>> way): with scan_get_next_rmap skipping the non-present ptes,
>>> pages_to_scan is currently a limit on the _present_ pages scanned in
>>> one batch.
>>>       
>> You mean that now when you say: pages_to_scan = 512, it wont count the
>> none present ptes as part of the counter, so if we have 500 not present
>> ptes in the begining and then 512 ptes later, before it used to call
>> cmp_and_merge_page() only for 12 pages while now it will get called on
>> 512 pages?
>>     
>
> If I understand you right, yes, before it would do those 500 absent then
> 512 present in two batches, first 512 (of which only 12 present) then 500;
> whereas now it'll skip the 500 absent without counting them, and handle
> the 512 present in that same one batch.
>
>   
>> If yes, then I liked this change, it is more logical from cpu
>> consumption point of view,
>>     
>
> Yes, although it does spend a little time on the absent ones, it should
> be much less time than it spends comparing or checksumming on present ones.
>   

Yea, compare to the other stuff it is minor...

>   
>> and in addition we have that cond_reched()
>> so I dont see a problem with this.
>>     
>
> Right, that cond_resched() is vital in this case.
>
> By the way, something else I didn't highlight, a significant benefit
> from avoiding get_user_pages(): that was doing a mark_page_accessed()
> on every present pte that it found, interfering with pageout decisions.
>   

That is a great value, i didn't even thought about that...

> Hugh
>   

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/