linux-kernel - Re: cgroup: rmdir() does not complete

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20100910131144.0904d754.kamezawa.hiroyu@jp.fujitsu.com>
Date:	Fri, 10 Sep 2010 13:11:44 +0900
From:	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>
To:	Daisuke Nishimura <nishimura@....nes.nec.co.jp>
Cc:	Mark Hills <mark@...o.org.uk>,
	Peter Zijlstra <peterz@...radead.org>,
	Balbir Singh <balbir@...ux.vnet.ibm.com>,
	linux-kernel@...r.kernel.org
Subject: Re: cgroup: rmdir() does not complete

On Fri, 10 Sep 2010 13:05:39 +0900
Daisuke Nishimura <nishimura@....nes.nec.co.jp> wrote:

> On Fri, 10 Sep 2010 11:16:46 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com> wrote:
> 
> > On Fri, 10 Sep 2010 00:04:31 +0100 (BST)
> > Mark Hills <mark@...o.org.uk> wrote:
> > > The report on the spinning process (23586) is dominated by calls from 
> > > mem_cgroup_force_empty.
> > > 
> > > It seems to show lru_add_drain_all and drain_all_stock_sync are causing 
> > > the load (I assume drain_all_stock_sync has been optimised out). But I 
> > > don't think this is as important as what causes the spin.
> > > 
> > 
> > I noticed you use FUSE and it seems there is a problem in FUSE v.s. memcg.
> > I wrote a patch (onto 2.6.36 but can be applied..)
> > 
> Nice catch!
> 
> > Could you try this ? I'm sorry I don't use FUSE system and can't test
> > right now.
> > 
> Sorry, I can't either.
> 
> > ==
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>
> > 
> > memory cgroup catches all pages which is added to radix-tree and
> > assumes the pages will be added to LRU, somewhere.
> > But there are pages which not on LRU but on radix-tree. Then,
> > force_empty cannot find them and cannot finish ->pre_destroy(), rmdir
> > operations.
> > 
> > This patch adds __GFP_NOMEMCGROUP and avoids unnecessary, out-of-control
> > pages are registered to memory cgroup. 
> > 
> > Note: This gfp flag can be used for shmem handling, which now uses
> >       complicated heuristics.
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>
> > ---
> >  fs/fuse/dev.c       |   11 ++++++++++-
> >  include/linux/gfp.h |    7 +++++++
> >  mm/memcontrol.c     |    2 +-
> >  3 files changed, 18 insertions(+), 2 deletions(-)
> > 
> > Index: linux-2.6.36-rc3/fs/fuse/dev.c
> > ===================================================================
> > --- linux-2.6.36-rc3.orig/fs/fuse/dev.c
> > +++ linux-2.6.36-rc3/fs/fuse/dev.c
> > @@ -19,6 +19,7 @@
> >  #include <linux/pipe_fs_i.h>
> >  #include <linux/swap.h>
> >  #include <linux/splice.h>
> > +#include <linux/memcontrol.h>
> >  
> >  MODULE_ALIAS_MISCDEV(FUSE_MINOR);
> >  MODULE_ALIAS("devname:fuse");
> > @@ -683,6 +684,7 @@ static int fuse_try_move_page(struct fus
> >  	struct pipe_buffer *buf = cs->pipebufs;
> >  	struct address_space *mapping;
> >  	pgoff_t index;
> > +	gfp_t mask = GFP_KERNEL;
> >  
> >  	unlock_request(cs->fc, cs->req);
> >  	fuse_copy_finish(cs);
> > @@ -732,7 +734,14 @@ static int fuse_try_move_page(struct fus
> >  	remove_from_page_cache(oldpage);
> >  	page_cache_release(oldpage);
> >  
> > -	err = add_to_page_cache_locked(newpage, mapping, index, GFP_KERNEL);
> > +	/*
> > +	 * not-on-LRU pages are out of control. So, add to root cgroup.
> > + 	 * See mm/memcontrol.c for details.
> > +	 */
> > +	if (buf->flags & PIPE_BUF_FLAG_LRU)
> > +		mask |= __GFP_NOMEMCGROUP;
> > +
> > +	err = add_to_page_cache_locked(newpage, mapping, index, mask);
> >  	if (err) {
> >  		printk(KERN_WARNING "fuse_try_move_page: failed to add page");
> >  		goto out_fallback_unlock;
> > Index: linux-2.6.36-rc3/include/linux/gfp.h
> > ===================================================================
> > --- linux-2.6.36-rc3.orig/include/linux/gfp.h
> > +++ linux-2.6.36-rc3/include/linux/gfp.h
> > @@ -60,6 +60,13 @@ struct vm_area_struct;
> >  #define __GFP_NOTRACK	((__force gfp_t)0)
> >  #endif
> >  
> > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > +#define __GFP_NOMEMCGROUP	((__force gfp_t)0x400000u)
> > +	/* Don't track by memory cgroup */
> > +#else
> > +#define __GFP_NOMEMCGROUP	((__force gfp_t)0)
> > +#endif
> > +
> >  /*
> >   * This may seem redundant, but it's a way of annotating false positives vs.
> >   * allocations that simply cannot be supported (e.g. page tables).
> > Index: linux-2.6.36-rc3/mm/memcontrol.c
> > ===================================================================
> > --- linux-2.6.36-rc3.orig/mm/memcontrol.c
> > +++ linux-2.6.36-rc3/mm/memcontrol.c
> > @@ -2114,7 +2114,7 @@ int mem_cgroup_cache_charge(struct page 
> >  
> >  	if (mem_cgroup_disabled())
> >  		return 0;
> > -	if (PageCompound(page))
> > +	if (PageCompound(page) || (gfp_mask & __GFP_NOMEMCGROUP))
> >  		return 0;
> >  	/*
> >  	 * Corner case handling. This is called from add_to_page_cache()
> > 
> The comments above says "not-on-LRU pages are out of control. So, add to root cgroup.".
> But this change means that we don't charge these pages at all.
> 
> Should it be:
> 
> 	if (gfp_mask & __GFP_NOMEMCGROUP))
> 		mm = &init_mm;
> 
> ?
> Or, change the comment ?
> 

yes....the comment is wrong.

Thanks,
-Kame
==

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>

memory cgroup catches all pages which is added to radix-tree and
assumes the pages will be added to LRU, somewhere.
But there are pages which not on LRU but on radix-tree. Then,
force_empty cannot find them and cannot finish ->pre_destroy(), rmdir
operations.

This patch adds __GFP_NOMEMCGROUP and avoids unnecessary, out-of-control
pages are registered to memory cgroup. 

Note: This gfp flag can be used for shmem handling, which now uses
      complicated heuristics.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>
---
 fs/fuse/dev.c       |   11 ++++++++++-
 include/linux/gfp.h |    7 +++++++
 mm/memcontrol.c     |    2 +-
 3 files changed, 18 insertions(+), 2 deletions(-)

Index: linux-2.6.36-rc3/fs/fuse/dev.c
===================================================================
--- linux-2.6.36-rc3.orig/fs/fuse/dev.c
+++ linux-2.6.36-rc3/fs/fuse/dev.c
@@ -19,6 +19,7 @@
 #include <linux/pipe_fs_i.h>
 #include <linux/swap.h>
 #include <linux/splice.h>
+#include <linux/memcontrol.h>
 
 MODULE_ALIAS_MISCDEV(FUSE_MINOR);
 MODULE_ALIAS("devname:fuse");
@@ -683,6 +684,7 @@ static int fuse_try_move_page(struct fus
 	struct pipe_buffer *buf = cs->pipebufs;
 	struct address_space *mapping;
 	pgoff_t index;
+	gfp_t mask = GFP_KERNEL;
 
 	unlock_request(cs->fc, cs->req);
 	fuse_copy_finish(cs);
@@ -732,7 +734,14 @@ static int fuse_try_move_page(struct fus
 	remove_from_page_cache(oldpage);
 	page_cache_release(oldpage);
 
-	err = add_to_page_cache_locked(newpage, mapping, index, GFP_KERNEL);
+	/*
+	 * non-LRU pages are out of cgroup controls.
+ 	 * See mm/memcontrol.c or Documentation/cgroup/memory.txt for details.
+	 */
+	if (buf->flags & PIPE_BUF_FLAG_LRU)
+		mask |= __GFP_NOMEMCGROUP;
+
+	err = add_to_page_cache_locked(newpage, mapping, index, mask);
 	if (err) {
 		printk(KERN_WARNING "fuse_try_move_page: failed to add page");
 		goto out_fallback_unlock;
Index: linux-2.6.36-rc3/include/linux/gfp.h
===================================================================
--- linux-2.6.36-rc3.orig/include/linux/gfp.h
+++ linux-2.6.36-rc3/include/linux/gfp.h
@@ -60,6 +60,13 @@ struct vm_area_struct;
 #define __GFP_NOTRACK	((__force gfp_t)0)
 #endif
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#define __GFP_NOMEMCGROUP	((__force gfp_t)0x400000u)
+	/* Don't track by memory cgroup */
+#else
+#define __GFP_NOMEMCGROUP	((__force gfp_t)0)
+#endif
+
 /*
  * This may seem redundant, but it's a way of annotating false positives vs.
  * allocations that simply cannot be supported (e.g. page tables).
Index: linux-2.6.36-rc3/mm/memcontrol.c
===================================================================
--- linux-2.6.36-rc3.orig/mm/memcontrol.c
+++ linux-2.6.36-rc3/mm/memcontrol.c
@@ -2114,7 +2114,7 @@ int mem_cgroup_cache_charge(struct page 
 
 	if (mem_cgroup_disabled())
 		return 0;
-	if (PageCompound(page))
+	if (PageCompound(page) || (gfp_mask & __GFP_NOMEMCGROUP))
 		return 0;
 	/*
 	 * Corner case handling. This is called from add_to_page_cache()


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/