linux-kernel - Re: [PATCH] add extra free kbytes tunable

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <51307583.2020006@gmail.com>
Date:	Fri, 01 Mar 2013 17:31:47 +0800
From:	Simon Jeons <simon.jeons@...il.com>
To:	Johannes Weiner <hannes@...xchg.org>
CC:	dormando <dormando@...ia.net>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Rik van Riel <riel@...hat.com>,
	Seiji Aguchi <seiji.aguchi@....com>,
	Satoru Moriya <satoru.moriya@....com>,
	Randy Dunlap <rdunlap@...otime.net>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>,
	"lwoodman@...hat.com" <lwoodman@...hat.com>,
	"hughd@...gle.com" <hughd@...gle.com>, Mel Gorman <mel@....ul.ie>
Subject: Re: [PATCH] add extra free kbytes tunable

On 03/01/2013 05:22 PM, Simon Jeons wrote:
> Hi Johannes,
>
> On 02/23/2013 01:56 AM, Johannes Weiner wrote:
>> On Tue, Feb 19, 2013 at 09:19:27PM -0800, dormando wrote:
>>>> The problem is that adding this tunable will constrain future VM
>>>> implementations.  We will forever need to at least retain the
>>>> pseudo-file.  We will also need to make some effort to retain its
>>>> behaviour.
>>>>
>>>> It would of course be better to fix things so you don't need to tweak
>>>> VM internals to get acceptable behaviour.
>>> I sympathize with this. It's presently all that keeps us afloat though.
>>> I'll whine about it again later if nothing else pans out.
>>>
>>>> You said:
>>>>
>>>> : We have a server workload wherein machines with 100G+ of "free" 
>>>> memory
>>>> : (used by page cache), scattered but frequent random io reads from 
>>>> 12+
>>>> : SSD's, and 5gbps+ of internet traffic, will frequently hit direct 
>>>> reclaim
>>>> : in a few different ways.
>>>> :
>>>> : 1) It'll run into small amounts of reclaim randomly (a few hundred
>>>> : thousand).
>>>> :
>>>> : 2) A burst of reads or traffic can cause extra pressure, which 
>>>> kswapd
>>>> : occasionally responds to by freeing up 40g+ of the pagecache all 
>>>> at once
>>>> : (!) while pausing the system (Argh).
>>>> :
>>>> : 3) A blip in an upstream provider or failover from a peer causes the
>>>> : kernel to allocate massive amounts of memory for retransmission
>>>> : queues/etc, potentially along with buffered IO reads and (some, 
>>>> but not
>>>> : often a ton) of new allocations from an application. This paired 
>>>> with 2)
>>>> : can cause the box to stall for 15+ seconds.
>>>>
>>>> Can we prioritise these?  2) looks just awful - kswapd shouldn't just
>>>> go off and free 40G of pagecache.  Do you know what's actually in that
>>>> pagecache?  Large number of small files or small number of (very) 
>>>> large
>>>> files?
>>> We have a handful of huge files (6-12ish 200g+) that are mmap'ed and
>>> accessed via address. occasionally madvise (WILLNEED) applied to the
>>> address ranges before attempting to use them. There're a mix of other
>>> files but nothing significant. The mmap's are READONLY and writes 
>>> are done
>>> via pwrite-ish functions.
>>>
>>> I could use some guidance on inspecting/tracing the problem. I've been
>>> trying to reproduce it in a lab, and respecting to 2)'s issue I've 
>>> found:
>>>
>>> - The amount of memory freed back up is either a percentage of total
>>> memory or a percentage of free memory. (a machine with 48G of ram will
>>> "only" free up an extra 4-7g)
>>>
>>> - It's most likely to happen after a fresh boot, or if "3 > 
>>> drop_caches"
>>> is applied with the application down. As it fills it seems to get 
>>> itself
>>> into trouble, but becomes more stable after that. Unfortunately 1) 
>>> and 3)
>>> still apply to a stable instance.
>>>
>>> - Protecting the DMA32 zone with something like "1 1 32" into
>>> lowmem_reserve_ratio makes the mass-reclaiming less likely to happen.
>>>
>>> - While watching "sar -B 1" I'll see kswapd wake up, and scan up to 
>>> a few
>>> hundred thousand pages before finding anything it actually wants to
>>> reclaim (low vmeff). I've only been able to reproduce this from a clean
>>> start. It can take up to 3 seconds before kswapd starts actually
>>> reclaiming pages.
>>>
>>> - So far as I can tell we're almost exclusively using 0 order 
>>> allocations.
>>> THP is disabled.
>>>
>>> There's not much dirty memory involved. It's not flushing out writes 
>>> while
>>> reclaiming, it just kills off massive amount of cached memory.
>> Mapped file pages have to get scanned twice before they are reclaimed
>> because we don't have enough usage information after the first scan.
>
> It seems that just VM_EXEC mapped file pages are protected.
> Issue in page reclaim subsystem:
> static inline int page_is_file_cache(struct page *page)
> {
>     return !PageSwapBacked(page);
> }
> AFAIK, PG_swapbacked is set if anonymous page added to swap cache, and 
> be cleaned if removed from swap cache. So anonymous pages which are 
> reclaimed and add to swap cache won't have this flag, then they will 
> be treated as

s/are/aren't

> file backed pages?  Is it buggy? In function __add_to_swap_cache if 
> add to radix tree successfully will result in increase NR_FILE_PAGES, 
> why?
>>
>> In your case, when you start this workload after a fresh boot or
>> dropping the caches, there will be 48G of mapped file pages that have
>> never been scanned before and that need to be looked at twice.
>>
>> Unfortunately, if kswapd does not make progress (and it won't for some
>> time at first), it will scan more and more aggressively with
>
> Why kswapd does not make progress for some time at first?
>
>> increasing scan priority.  And when the 48G of pages are finally
>> cycled, kswapd's scan window is a large percentage of your machine's
>> memory, and it will free every single page in it.
>>
>> I think we should think about capping kswapd zone reclaim cycles just
>> as we do for direct reclaim.  It's a little ridiculous that it can run
>> unbounded and reclaim every page in a zone without ever checking back
>> against the watermark.  We still increase the scan window evenly when
>> we don't make forward progress, but we are more carefully inching zone
>> levels back toward the watermarks.
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index c4883eb..8a4c446 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -2645,10 +2645,11 @@ static unsigned long balance_pgdat(pg_data_t 
>> *pgdat, int order,
>>           .may_unmap = 1,
>>           .may_swap = 1,
>>           /*
>> -         * kswapd doesn't want to be bailed out while reclaim. because
>> -         * we want to put equal scanning pressure on each zone.
>> +         * Even kswapd zone scans want to be bailed out after
>> +         * reclaiming a good chunk of pages.  It will just
>> +         * come back if the watermarks are still not met.
>>            */
>> -        .nr_to_reclaim = ULONG_MAX,
>> +        .nr_to_reclaim = SWAP_CLUSTER_MAX,
>>           .order = order,
>>           .target_mem_cgroup = NULL,
>>       };
>>
>> -- 
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@...ck.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@...ck.org"> email@...ck.org </a>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/