linux-kernel - Re: [PATCH v5 9/9] memcg: check memcg dirty limits in page writeback

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20110301044455.GB2107@barrios-desktop>
Date:	Tue, 1 Mar 2011 13:44:55 +0900
From:	Minchan Kim <minchan.kim@...il.com>
To:	Greg Thelen <gthelen@...gle.com>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org,
	containers@...ts.osdl.org, Andrea Righi <arighi@...eler.com>,
	Balbir Singh <balbir@...ux.vnet.ibm.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	Daisuke Nishimura <nishimura@....nes.nec.co.jp>,
	Ciju Rajan K <ciju@...ux.vnet.ibm.com>,
	David Rientjes <rientjes@...gle.com>,
	Wu Fengguang <fengguang.wu@...el.com>,
	Chad Talbott <ctalbott@...gle.com>,
	Justin TerAvest <teravest@...gle.com>,
	Vivek Goyal <vgoyal@...hat.com>
Subject: Re: [PATCH v5 9/9] memcg: check memcg dirty limits in page
 writeback

On Fri, Feb 25, 2011 at 01:36:00PM -0800, Greg Thelen wrote:
> If the current process is in a non-root memcg, then
> balance_dirty_pages() will consider the memcg dirty limits as well as
> the system-wide limits.  This allows different cgroups to have distinct
> dirty limits which trigger direct and background writeback at different
> levels.
> 
> If called with a mem_cgroup, then throttle_vm_writeout() should query
> the given cgroup for its dirty memory usage limits.
> 
> Signed-off-by: Andrea Righi <arighi@...eler.com>
> Signed-off-by: Greg Thelen <gthelen@...gle.com>
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>
> Acked-by: Wu Fengguang <fengguang.wu@...el.com>
> ---
  
<snip>

>  /*
> @@ -477,12 +502,14 @@ unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, unsigned long dirty)
>   * data.  It looks at the number of dirty pages in the machine and will force
>   * the caller to perform writeback if the system is over `vm_dirty_ratio'.
>   * If we're over `background_thresh' then the writeback threads are woken to
> - * perform some writeout.
> + * perform some writeout.  The current task may have per-memcg dirty
> + * limits, which are also checked.
>   */
>  static void balance_dirty_pages(struct address_space *mapping,
>  				unsigned long write_chunk)
>  {
>  	struct dirty_info sys_info;
> +	struct dirty_info memcg_info;
>  	long bdi_nr_reclaimable;
>  	long bdi_nr_writeback;
>  	unsigned long bdi_thresh;
> @@ -500,18 +527,27 @@ static void balance_dirty_pages(struct address_space *mapping,
>  		};
>  
>  		global_dirty_info(&sys_info);
> +		if (!memcg_dirty_info(NULL, &memcg_info))
> +			memcg_info = sys_info;


Sigh.

I don't like dobule check in case of no-memcg configuration or root memcgroup.
"Dobule check" means 1) sys_info check and 2) memcg_info check but it's same so
second check is redundant. 
In addition, we always need same logic between global and memcg.
It adds binary bloating.

>  
>  		/*
>  		 * Throttle it only when the background writeback cannot
>  		 * catch-up. This avoids (excessively) small writeouts
>  		 * when the bdi limits are ramping up.
>  		 */
> -		if (dirty_info_reclaimable(&sys_info) + sys_info.nr_writeback <=
> +		if ((dirty_info_reclaimable(&sys_info) +
> +		     sys_info.nr_writeback <=
>  				(sys_info.background_thresh +
> -				 sys_info.dirty_thresh) / 2)
> +				 sys_info.dirty_thresh) / 2) &&
> +		    (dirty_info_reclaimable(&memcg_info) +
> +		     memcg_info.nr_writeback <=
> +				(memcg_info.background_thresh +
> +				 memcg_info.dirty_thresh) / 2))
>  			break;
>  
> -		bdi_thresh = bdi_dirty_limit(bdi, sys_info.dirty_thresh);
> +		bdi_thresh = bdi_dirty_limit(bdi,
> +				min(sys_info.dirty_thresh,
> +				    memcg_info.dirty_thresh));
>  		bdi_thresh = task_dirty_limit(current, bdi_thresh);
>  
>  		/*
> @@ -541,7 +577,9 @@ static void balance_dirty_pages(struct address_space *mapping,
>  		dirty_exceeded =
>  			(bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
>  			|| (dirty_info_reclaimable(&sys_info) +
> -			     sys_info.nr_writeback > sys_info.dirty_thresh);
> +			    sys_info.nr_writeback > sys_info.dirty_thresh)
> +			|| (dirty_info_reclaimable(&memcg_info) +
> +			    memcg_info.nr_writeback > memcg_info.dirty_thresh);
>  
>  		if (!dirty_exceeded)
>  			break;
> @@ -559,7 +597,9 @@ static void balance_dirty_pages(struct address_space *mapping,
>  		 * up.
>  		 */
>  		trace_wbc_balance_dirty_start(&wbc, bdi);
> -		if (bdi_nr_reclaimable > bdi_thresh) {
> +		if ((bdi_nr_reclaimable > bdi_thresh) ||
> +		    (dirty_info_reclaimable(&memcg_info) >
> +		     memcg_info.dirty_thresh)) {

Why does memcg need this check?
I guess bdi_thresh is the f(x), x = dirty_thresh and dirty_thresh is already memcg's one.
So isn't it enough?
I don't know the logic well but as I look at the comment, at least you change the behavior.
Please write down why you need it and if you need it, please change comment, too.

"
 * Only move pages to writeback if this bdi is over its
 * threshold otherwise wait until the disk writes catch
 * up.
"

>  			writeback_inodes_wb(&bdi->wb, &wbc);
>  			pages_written += write_chunk - wbc.nr_to_write;
>  			trace_wbc_balance_dirty_written(&wbc, bdi);
> @@ -594,8 +634,10 @@ static void balance_dirty_pages(struct address_space *mapping,
>  	 * background_thresh, to keep the amount of dirty memory low.
>  	 */
>  	if ((laptop_mode && pages_written) ||
> -	    (!laptop_mode && (dirty_info_reclaimable(&sys_info) >
> -			      sys_info.background_thresh)))
> +	    (!laptop_mode && ((dirty_info_reclaimable(&sys_info) >
> +			       sys_info.background_thresh) ||
> +			      (dirty_info_reclaimable(&memcg_info) >
> +			       memcg_info.background_thresh))))
>  		bdi_start_background_writeback(bdi);
>  }
>  
> @@ -653,12 +695,20 @@ void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
>  }
>  EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
>  
> -void throttle_vm_writeout(gfp_t gfp_mask)
> +/*
> + * Throttle the current task if it is near dirty memory usage limits.
> + * If @mem_cgroup is NULL or the root_cgroup, then use global dirty memory
> + * information; otherwise use the per-memcg dirty limits.
> + */
> +void throttle_vm_writeout(gfp_t gfp_mask, struct mem_cgroup *mem_cgroup)
>  {
>  	struct dirty_info sys_info;
> +	struct dirty_info memcg_info;
>  
>          for ( ; ; ) {
>  		global_dirty_info(&sys_info);
> +		if (!memcg_dirty_info(mem_cgroup, &memcg_info))
> +			memcg_info = sys_info;
>  
>                  /*
>                   * Boost the allowable dirty threshold a bit for page
> @@ -666,9 +716,13 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>                   */
>  		sys_info.dirty_thresh +=
>  			sys_info.dirty_thresh / 10;      /* wheeee... */
> +		memcg_info.dirty_thresh +=
> +			memcg_info.dirty_thresh / 10;    /* wheeee... */
>  
> -		if (sys_info.nr_unstable_nfs +
> -		    sys_info.nr_writeback <= sys_info.dirty_thresh)
> +		if ((sys_info.nr_unstable_nfs +
> +		     sys_info.nr_writeback <= sys_info.dirty_thresh) &&
> +		    (memcg_info.nr_unstable_nfs +
> +		     memcg_info.nr_writeback <= memcg_info.dirty_thresh))
>  			break;
>                  congestion_wait(BLK_RW_ASYNC, HZ/10);
>  
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ba11e28..f723242 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1927,7 +1927,7 @@ restart:
>  					sc->nr_scanned - nr_scanned, sc))
>  		goto restart;
>  
> -	throttle_vm_writeout(sc->gfp_mask);
> +	throttle_vm_writeout(sc->gfp_mask, sc->mem_cgroup);
>  }
>  
>  /*
> -- 
> 1.7.3.1
> 

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/