linux-kernel - Re: [PATCH] add extra free kbytes tunable

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <51307354.5000401@gmail.com>
Date:	Fri, 01 Mar 2013 17:22:28 +0800
From:	Simon Jeons <simon.jeons@...il.com>
To:	Johannes Weiner <hannes@...xchg.org>
CC:	dormando <dormando@...ia.net>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Rik van Riel <riel@...hat.com>,
	Seiji Aguchi <seiji.aguchi@....com>,
	Satoru Moriya <satoru.moriya@....com>,
	Randy Dunlap <rdunlap@...otime.net>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>,
	"lwoodman@...hat.com" <lwoodman@...hat.com>,
	"hughd@...gle.com" <hughd@...gle.com>, Mel Gorman <mel@....ul.ie>
Subject: Re: [PATCH] add extra free kbytes tunable

Hi Johannes,

On 02/23/2013 01:56 AM, Johannes Weiner wrote:
> On Tue, Feb 19, 2013 at 09:19:27PM -0800, dormando wrote:
>>> The problem is that adding this tunable will constrain future VM
>>> implementations.  We will forever need to at least retain the
>>> pseudo-file.  We will also need to make some effort to retain its
>>> behaviour.
>>>
>>> It would of course be better to fix things so you don't need to tweak
>>> VM internals to get acceptable behaviour.
>> I sympathize with this. It's presently all that keeps us afloat though.
>> I'll whine about it again later if nothing else pans out.
>>
>>> You said:
>>>
>>> : We have a server workload wherein machines with 100G+ of "free" memory
>>> : (used by page cache), scattered but frequent random io reads from 12+
>>> : SSD's, and 5gbps+ of internet traffic, will frequently hit direct reclaim
>>> : in a few different ways.
>>> :
>>> : 1) It'll run into small amounts of reclaim randomly (a few hundred
>>> : thousand).
>>> :
>>> : 2) A burst of reads or traffic can cause extra pressure, which kswapd
>>> : occasionally responds to by freeing up 40g+ of the pagecache all at once
>>> : (!) while pausing the system (Argh).
>>> :
>>> : 3) A blip in an upstream provider or failover from a peer causes the
>>> : kernel to allocate massive amounts of memory for retransmission
>>> : queues/etc, potentially along with buffered IO reads and (some, but not
>>> : often a ton) of new allocations from an application. This paired with 2)
>>> : can cause the box to stall for 15+ seconds.
>>>
>>> Can we prioritise these?  2) looks just awful - kswapd shouldn't just
>>> go off and free 40G of pagecache.  Do you know what's actually in that
>>> pagecache?  Large number of small files or small number of (very) large
>>> files?
>> We have a handful of huge files (6-12ish 200g+) that are mmap'ed and
>> accessed via address. occasionally madvise (WILLNEED) applied to the
>> address ranges before attempting to use them. There're a mix of other
>> files but nothing significant. The mmap's are READONLY and writes are done
>> via pwrite-ish functions.
>>
>> I could use some guidance on inspecting/tracing the problem. I've been
>> trying to reproduce it in a lab, and respecting to 2)'s issue I've found:
>>
>> - The amount of memory freed back up is either a percentage of total
>> memory or a percentage of free memory. (a machine with 48G of ram will
>> "only" free up an extra 4-7g)
>>
>> - It's most likely to happen after a fresh boot, or if "3 > drop_caches"
>> is applied with the application down. As it fills it seems to get itself
>> into trouble, but becomes more stable after that. Unfortunately 1) and 3)
>> still apply to a stable instance.
>>
>> - Protecting the DMA32 zone with something like "1 1 32" into
>> lowmem_reserve_ratio makes the mass-reclaiming less likely to happen.
>>
>> - While watching "sar -B 1" I'll see kswapd wake up, and scan up to a few
>> hundred thousand pages before finding anything it actually wants to
>> reclaim (low vmeff). I've only been able to reproduce this from a clean
>> start. It can take up to 3 seconds before kswapd starts actually
>> reclaiming pages.
>>
>> - So far as I can tell we're almost exclusively using 0 order allocations.
>> THP is disabled.
>>
>> There's not much dirty memory involved. It's not flushing out writes while
>> reclaiming, it just kills off massive amount of cached memory.
> Mapped file pages have to get scanned twice before they are reclaimed
> because we don't have enough usage information after the first scan.

It seems that just VM_EXEC mapped file pages are protected.
Issue in page reclaim subsystem:
static inline int page_is_file_cache(struct page *page)
{
     return !PageSwapBacked(page);
}
AFAIK, PG_swapbacked is set if anonymous page added to swap cache, and 
be cleaned if removed from swap cache. So anonymous pages which are 
reclaimed and add to swap cache won't have this flag, then they will be 
treated as file backed pages?  Is it buggy? In function 
__add_to_swap_cache if add to radix tree successfully will result in 
increase NR_FILE_PAGES, why?
>
> In your case, when you start this workload after a fresh boot or
> dropping the caches, there will be 48G of mapped file pages that have
> never been scanned before and that need to be looked at twice.
>
> Unfortunately, if kswapd does not make progress (and it won't for some
> time at first), it will scan more and more aggressively with

Why kswapd does not make progress for some time at first?

> increasing scan priority.  And when the 48G of pages are finally
> cycled, kswapd's scan window is a large percentage of your machine's
> memory, and it will free every single page in it.
>
> I think we should think about capping kswapd zone reclaim cycles just
> as we do for direct reclaim.  It's a little ridiculous that it can run
> unbounded and reclaim every page in a zone without ever checking back
> against the watermark.  We still increase the scan window evenly when
> we don't make forward progress, but we are more carefully inching zone
> levels back toward the watermarks.
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c4883eb..8a4c446 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2645,10 +2645,11 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>   		.may_unmap = 1,
>   		.may_swap = 1,
>   		/*
> -		 * kswapd doesn't want to be bailed out while reclaim. because
> -		 * we want to put equal scanning pressure on each zone.
> +		 * Even kswapd zone scans want to be bailed out after
> +		 * reclaiming a good chunk of pages.  It will just
> +		 * come back if the watermarks are still not met.
>   		 */
> -		.nr_to_reclaim = ULONG_MAX,
> +		.nr_to_reclaim = SWAP_CLUSTER_MAX,
>   		.order = order,
>   		.target_mem_cgroup = NULL,
>   	};
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@...ck.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@...ck.org"> email@...ck.org </a>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/