linux-kernel - Re: [RFC -mm] memcg: prevent from OOM with too many dirty pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20120531151816.GA32252@localhost>
Date:	Thu, 31 May 2012 23:18:16 +0800
From:	Fengguang Wu <fengguang.wu@...el.com>
To:	Michal Hocko <mhocko@...e.cz>
Cc:	Johannes Weiner <hannes@...xchg.org>, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org,
	Andrew Morton <akpm@...ux-foundation.org>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujtisu.com>,
	Mel Gorman <mgorman@...e.de>, Minchan Kim <minchan@...nel.org>,
	Rik van Riel <riel@...hat.com>,
	Ying Han <yinghan@...gle.com>,
	Greg Thelen <gthelen@...gle.com>,
	Hugh Dickins <hughd@...gle.com>
Subject: Re: [RFC -mm] memcg: prevent from OOM with too many dirty pages

On Tue, May 29, 2012 at 03:51:01PM +0200, Michal Hocko wrote:
> On Tue 29-05-12 11:35:11, Johannes Weiner wrote:
> [...]
> >         if (nr_writeback && nr_writeback >= (nr_taken >> (DEF_PRIORITY-priority)))
> >                 wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> > 
> > But the problem is the part declaring the zone congested:
> > 
> >         /*
> >          * Tag a zone as congested if all the dirty pages encountered were
> >          * backed by a congested BDI. In this case, reclaimers should just
> >          * back off and wait for congestion to clear because further reclaim
> >          * will encounter the same problem
> >          */
> >         if (nr_dirty && nr_dirty == nr_congested && global_reclaim(sc))
> >                 zone_set_flag(mz->zone, ZONE_CONGESTED);
> > 
> > Note the global_reclaim().  It would be nice to have these two operate
> > against the lruvec of sc->target_mem_cgroup and mz->zone instead.  The
> > problem is that ZONE_CONGESTED clearing happens in kswapd alone, which
> > is not necessarily involved in a memcg-constrained load, so we need to
> > find clearing sites that work for both global and memcg reclaim.
> 
> OK, I have tried it with a simpler approach:
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c978ce4..e45cf2a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1294,8 +1294,12 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>  	 *                     isolated page is PageWriteback
>  	 */
>  	if (nr_writeback && nr_writeback >=
> -			(nr_taken >> (DEF_PRIORITY - sc->priority)))
> -		wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> +			(nr_taken >> (DEF_PRIORITY - sc->priority))) {
> +		if (global_reclaim(sc))
> +			wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> +		else
> +			congestion_wait(BLK_RW_ASYNC, HZ/10);
> +	}
>  
>  	trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
>  		zone_idx(zone),
> 
> without 'lruvec-zone' congestion flag and it worked reasonably well, for
> my testcase at least (no OOM). We still could stall even if we managed
> to writeback pages in the meantime but we should at least prevent from
> the problem you are mentioning (most of the time).
> 
> The issue with pagevec zone tagging is, as you mentioned, that the
> flag clearing places are not that easy to get right because we do
> not have anything like zone_watermark_ok in a memcg context. I am even
> thinking whether it is possible without per-memcg dirtly accounting.
> 
> To be honest, I was considering congestion waiting at the beginning as
> well but I hate using an arbitrary timeout when we are, in fact, waiting
> for a specific event.
> Nevertheless I do acknowledge your concern with accidental page reclaim
> pages in the middle of the LRU because of clean page cache which would
> lead to an unnecessary stalls.

Hi Michal,

Now the only concern is, to confirm whether the patch will impact
interactive performance when there are not so many dirty pages in the
memcg.

For example, running a dd write to disk plus several another dd's read
from either disk or sparse file.

There is no dirty accounting for memcg, however if you run workloads
in one single 100MB memcg, the global dirty pages in /proc/vmstat will
be exactly the dirty number inside that memcg. Thus we can create
situations with eg. 10%, 30%, 50% dirty pages inside memcg and watch
how well your patch performs.

I happen to have a debug patch for showing the number of page reclaim
stalls.  It applies cleanly to 3.4, and you'll need to add accounting
to your new code. If it shows low stall numbers in the cases of 10-30%
dirty pages even if they are quickly rotated due to fast reads, we may
go ahead with any approach :-)

Thanks,
Fengguang

View attachment "mm-debugfs-vmscan-stalls-0.patch" of type "text/x-diff" (4544 bytes)