linux-kernel - Re: [RFC PATCH 0/3] Avoid the use of congestion

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4BCEAAC6.7070602@linux.vnet.ibm.com>
Date:	Wed, 21 Apr 2010 09:35:34 +0200
From:	Christian Ehrhardt <ehrhardt@...ux.vnet.ibm.com>
To:	Rik van Riel <riel@...hat.com>
CC:	Johannes Weiner <hannes@...xchg.org>, Mel Gorman <mel@....ul.ie>,
	Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
	Nick Piggin <npiggin@...e.de>,
	Chris Mason <chris.mason@...cle.com>,
	Jens Axboe <jens.axboe@...cle.com>,
	linux-kernel@...r.kernel.org, gregkh@...ell.com,
	Corrado Zoccolo <czoccolo@...il.com>
Subject: Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure



Christian Ehrhardt wrote:
> 
> 
> Rik van Riel wrote:
>> On 04/20/2010 11:32 AM, Johannes Weiner wrote:
>>
>>> The idea is that it pans out on its own.  If the workload changes, new
>>> pages get activated and when that set grows too large, we start 
>>> shrinking
>>> it again.
>>>
>>> Of course, right now this unscanned set is way too large and we can end
>>> up wasting up to 50% of usable page cache on false active pages.
>>
>> Thing is, changing workloads often change back.
>>
>> Specifically, think of a desktop system that is doing
>> work for the user during the day and gets backed up
>> at night.
>>
>> You do not want the backup to kick the working set
>> out of memory, because when the user returns in the
>> morning the desktop should come back quickly after
>> the screensaver is unlocked.
> 
> IMHO it is fine to prevent that nightly backup job from not being 
> finished when the user arrives at morning because we didn't give him 
> some more cache - and e.g. a 30 sec transition from/to both optimized 
> states is fine.
> But eventually I guess the point is that both behaviors are reasonable 
> to achieve - depending on the users needs.
> 
> What we could do is combine all our thoughts we had so far:
> a) Rik could create an experimental patch that excludes the in flight pages
> b) Johannes could create one for his suggestion to "always scan active 
> file pages but only deactivate them when the ratio is off and otherwise 
> strip buffers of clean pages"
> c) I would extend the patch from Johannes setting the ratio of 
> active/inactive pages to be a userspace tunable

A first revision of patch c is attached.
I tested assigning different percentages, so far e.g. 50 really behave 
like before and 25 protects ~42M Buffers in my example which would match 
the intended behavior - see patch for more details.

Checkpatch and some basic function tests went fine.
While it may be not perfect yet, I think it is ready for feedback now.

> a,b,a+b would then need to be tested if they achieve a better behavior.
> 
> c on the other hand would be a fine tunable to let administrators 
> (knowing their workloads) or distributions (e.g. different values for 
> Desktop/Server defaults) adapt their installations.
> 
> In theory a,b and c should work fine together in case we need all of them.
> 
>> The big question is, what workload suffers from
>> having the inactive list at 50% of the page cache?
>>
>> So far the only big problem we have seen is on a
>> very unbalanced virtual machine, with 256MB RAM
>> and 4 fast disks.  The disks simply have more IO
>> in flight at once than what fits in the inactive
>> list.
> 
> Did I get you right that this means the write case - explaining why it 
> is building up buffers to the 50% max?
> 

Thinking about it I wondered for what these Buffers are protected.
If the intention to save these buffers is for reuse with similar loads I 
wonder why I "need" three iozones to build up the 85M in my case.

Buffers start at ~0, after iozone run 1 they are at ~35, then after #2 
~65 and after run #3 ~85.
Shouldn't that either allocate 85M for the first directly in case that 
much is needed for a single run - or if not the second and third run 
just "resuse" the 35M Buffers from the first run still held?

Note - "1 iozone run" means "iozone ... -i 0" which sequentially writes 
and then rewrites a 2Gb file on 16 disks in my current case.

looking forward especially to patch b as I'd really like to see a kernel 
able to win back these buffers if they are no more used for a longer 
period while still allowing to grow&protect them while needed.

-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

View attachment "active-inacte-ratio-tunable.diff" of type "text/x-patch" (4673 bytes)