lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <20110323180628.09cd770e.kamezawa.hiroyu@jp.fujitsu.com>
Date:	Wed, 23 Mar 2011 18:06:28 +0900
From:	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>
To:	Greg Thelen <gthelen@...gle.com>
Cc:	Johannes Weiner <hannes@...xchg.org>, Jan Kara <jack@...e.cz>,
	Vivek Goyal <vgoyal@...hat.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org,
	containers@...ts.osdl.org, linux-fsdevel@...r.kernel.org,
	Andrea Righi <arighi@...eler.com>,
	Balbir Singh <balbir@...ux.vnet.ibm.com>,
	Daisuke Nishimura <nishimura@....nes.nec.co.jp>,
	Minchan Kim <minchan.kim@...il.com>,
	Ciju Rajan K <ciju@...ux.vnet.ibm.com>,
	David Rientjes <rientjes@...gle.com>,
	Wu Fengguang <fengguang.wu@...el.com>,
	Chad Talbott <ctalbott@...gle.com>,
	Justin TerAvest <teravest@...gle.com>,
	Curt Wohlgemuth <curtw@...gle.com>
Subject: Re: [PATCH v6 0/9] memcg: per cgroup dirty page accounting

On Fri, 18 Mar 2011 00:57:09 -0700
Greg Thelen <gthelen@...gle.com> wrote:

> On Thu, Mar 17, 2011 at 5:43 AM, Johannes Weiner <hannes@...xchg.org> wrote:
> > On Wed, Mar 16, 2011 at 09:41:48PM -0700, Greg Thelen wrote:
> >> In '[PATCH v6 8/9] memcg: check memcg dirty limits in page writeback' Jan and
> >> Vivek have had some discussion around how memcg and writeback mesh.
> >> In my mind, the
> >> discussions in 8/9 are starting to blend with this thread.
> >>
> >> I have been thinking about Johannes' struct memcg_mapping.  I think the idea
> >> may address several of the issues being discussed, especially
> >> interaction between
> >> IO-less balance_dirty_pages() and memcg writeback.
> >>
> >> Here is my thinking.  Feedback is most welcome!
> >>
> >> The data structures:
> >> - struct memcg_mapping {
> >>        struct address_space *mapping;
> >>        struct mem_cgroup *memcg;
> >>        int refcnt;
> >>   };
> >> - each memcg contains a (radix, hash_table, etc.) mapping from bdi to memcg_bdi.
> >> - each memcg_bdi contains a mapping from inode to memcg_mapping.  This may be a
> >>   very large set representing many cached inodes.
> >> - each memcg_mapping represents all pages within an bdi,inode,memcg.  All
> >>   corresponding cached inode pages point to the same memcg_mapping via
> >>   pc->mapping.  I assume that all pages of inode belong to no more than one bdi.
> >> - manage a global list of memcg that are over their respective background dirty
> >>   limit.
> >> - i_mapping continues to point to a traditional non-memcg mapping (no change
> >>   here).
> >> - none of these memcg_* structures affect root cgroup or kernels with memcg
> >>   configured out.
> >
> > So structures roughly like this:
> >
> > struct mem_cgroup {
> >        ...
> >        /* key is struct backing_dev_info * */
> >        struct rb_root memcg_bdis;
> > };
> >
> > struct memcg_bdi {
> >        /* key is struct address_space * */
> >        struct rb_root memcg_mappings;
> >        struct rb_node node;
> > };
> >
> > struct memcg_mapping {
> >        struct address_space *mapping;
> >        struct mem_cgroup *memcg;
> >        struct rb_node node;
> >        atomic_t count;
> > };
> >
> > struct page_cgroup {
> >        ...
> >        struct memcg_mapping *memcg_mapping;
> > };
> >
> >> The routines under discussion:
> >> - memcg charging a new inode page to a memcg: will use inode->mapping and inode
> >>   to walk memcg -> memcg_bdi -> memcg_mapping and lazily allocating missing
> >>   levels in data structure.
> >>
> >> - Uncharging a inode page from a memcg: will use pc->mapping->memcg to locate
> >>   memcg.  If refcnt drops to zero, then remove memcg_mapping from the memcg_bdi.
> >>   Also delete memcg_bdi if last memcg_mapping is removed.
> >>
> >> - account_page_dirtied(): nothing new here, continue to set the per-page flags
> >>   and increment the memcg per-cpu dirty page counter.  Same goes for routines
> >>   that mark pages in writeback and clean states.
> >
> > We may want to remember the dirty memcg_mappings so that on writeback
> > we don't have to go through every single one that the memcg refers to?
> 
> I think this is a good idea to allow per memcg per bdi list of dirty mappings.
> 
> It feels like some of this is starting to gel.  I've been sketching
> some of the code to see how the memcg locking will work out.  The
> basic structures I see are:
> 
> struct mem_cgroup {
>         ...
>         /*
>          * For all file pages cached by this memcg sort by bdi.
>          * key is struct backing_dev_info *; value is struct memcg_bdi *
>          * Protected by bdis_lock.
>          */
>         struct rb_root bdis;
>         spinlock_t bdis_lock;  /* or use rcu structure, memcg:bdi set
> could be fairly static */
> };
> 
> struct memcg_bdi {
>         struct backing_dev_info *bdi;
>         /*
>          * key is struct address_space *; value is struct
> memcg_mapping *
>          * memcg_mappings live within either mappings or
> dirty_mappings set.
>          */
>         struct rb_root mappings;
>         struct rb_root dirty_mappings;
>         struct rb_node node;
>         spinlock_t lock; /* protect [dirty_]mappings */
> };
> 
> struct memcg_mapping {
>         struct address_space *mapping;
>         struct memcg_bdi *memcg_bdi;
>         struct rb_node node;
>         atomic_t nr_pages;
>         atomic_t nr_dirty;
> };
> 
> struct page_cgroup {
>         ...
>         struct memcg_mapping *memcg_mapping;
> };
> 
> - each memcg contains a mapping from bdi to memcg_bdi.
> - each memcg_bdi contains two mappings:
>   mappings: from address_space to memcg_mapping for clean pages
>   dirty_mappings: from address_space to memcg_mapping when there are
> some dirty pages
> - each memcg_mapping represents a set of cached pages within an
> bdi,inode,memcg.  All
>  corresponding cached inode pages point to the same memcg_mapping via
>  pc->mapping.  I assume that all pages of inode belong to no more than one bdi.
> - manage a global list of memcg that are over their respective background dirty
>  limit.
> - i_mapping continues to point to a traditional non-memcg mapping (no change
>  here).
> - none of these memcg_* structures affect root cgroup or kernels with memcg
>  configured out.
> 
> The routines under discussion:
> - memcg charging a new inode page to a memcg: will use inode->mapping and inode
>  to walk memcg -> memcg_bdi -> mappings and lazily allocating missing
>  levels in data structure.
> 
> - Uncharging a inode page from a memcg: will use pc->mapping->memcg to locate
>  memcg.  If refcnt drops to zero, then remove memcg_mapping from the
> memcg_bdi.[dirty_]mappings.
>  Also delete memcg_bdi if last memcg_mapping is removed.
> 
> - account_page_dirtied(): increment nr_dirty.  If first dirty page,
> then move memcg_mapping from memcg_bdi.mappings to
> memcg_bdi.dirty_mappings page counter.  When marking page clean, do
> the opposite.
> 
> - mem_cgroup_balance_dirty_pages(): if memcg dirty memory usage if above
>  background limit, then add memcg to global memcg_over_bg_limit list and use
>  memcg's set of memcg_bdi to wakeup each(?) corresponding bdi flusher.  If over
>  fg limit, then use IO-less style foreground throttling with per-memcg per-bdi
>  (aka memcg_bdi) accounting structure.
> 
> - bdi writeback: will revert some of the mmotm memcg dirty limit changes to
>  fs-writeback.c so that wb_do_writeback() will return to checking
>  wb_check_background_flush() to check background limits and being
> interruptible if sync flush occurs.  wb_check_background_flush() will
> check the global
>  memcg_over_bg_limit list for memcg that are over their dirty limit.
> Within each memcg write inodes from the dirty_mappings list until a
> threshold page count has been reached (MAX_WRITEBACK_PAGES).  Then
> move to next listed memcg.
> 
> - over_bground_thresh() will determine if memcg is still over bg limit.
>  If over limit, then it per bdi per memcg background flushing will continue.
>  If not over limit then memcg will be removed from memcg_over_bg_limit list.
> 
> I'll post my resulting patches in RFC form, or (at the least) my conclusions.
> 
please take care of force_empty and move_mapping at el. when you do this
and please do rmdir() tests.


Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ