linux-kernel - Re: [PATCH v6 0/9] memcg: per cgroup dirty page accounting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20110315105612.f600a659.kamezawa.hiroyu@jp.fujitsu.com>
Date:	Tue, 15 Mar 2011 10:56:12 +0900
From:	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>
To:	Greg Thelen <gthelen@...gle.com>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org,
	containers@...ts.osdl.org, linux-fsdevel@...r.kernel.org,
	Andrea Righi <arighi@...eler.com>,
	Balbir Singh <balbir@...ux.vnet.ibm.com>,
	Daisuke Nishimura <nishimura@....nes.nec.co.jp>,
	Minchan Kim <minchan.kim@...il.com>,
	Johannes Weiner <hannes@...xchg.org>,
	Ciju Rajan K <ciju@...ux.vnet.ibm.com>,
	David Rientjes <rientjes@...gle.com>,
	Wu Fengguang <fengguang.wu@...el.com>,
	Chad Talbott <ctalbott@...gle.com>,
	Justin TerAvest <teravest@...gle.com>,
	Vivek Goyal <vgoyal@...hat.com>
Subject: Re: [PATCH v6 0/9] memcg: per cgroup dirty page accounting

On Mon, 14 Mar 2011 11:29:17 -0700
Greg Thelen <gthelen@...gle.com> wrote:

> On Fri, Mar 11, 2011 at 5:10 PM, Andrew Morton

> My rational for pursuing bdi writeback was I/O locality.  I have heard that
> per-page I/O has bad locality.  Per inode bdi-style writeback should have better
> locality.
> 
> My hunch is the best solution is a hybrid which uses a) bdi writeback with a
> target memcg filter and b) using the memcg lru as a fallback to identify the bdi
> that needed writeback.  I think the part a) memcg filtering is likely something
> like:
>  http://marc.info/?l=linux-kernel&m=129910424431837
> 
> The part b) bdi selection should not be too hard assuming that page-to-mapping
> locking is doable.
> 

For now, I like b). 

> An alternative approach is to bind each inode to exactly one cgroup (possibly
> the root cgroup).  Both the cache page allocations and dirtying charges would be
> accounted to the i_cgroup.  With this approach there is no foreign dirtier issue
> because all pages are in a single cgroup.  I find this undesirable because the
> first memcg to touch an inode is charged for all pages later cached even by
> other memcg.
> 

I don't think 'foreign dirtier' is a big problem. When program does write(),
the file to be written is tend to be under control of the application in
the cgroup. I don't think 'written file' is shared between several cgroups, 
typically. But /var/log/messages is a shared one ;)

But I know some other OSs has 'group for file cache'. I'll not nack if you
propose such patch. Maybe there are some guys who want to limit the amount of
file cache.



> When a page is allocated it is charged to the current task's memcg.  When a
> memcg page is later marked dirty the dirty charge is billed to the memcg from
> the original page allocation.  The billed memcg may be different than the
> dirtying task's memcg.
> 
yes.

> After a rate limited number of file backed pages have been dirtied,
> balance_dirty_pages() is called to enforce dirty limits by a) throttling
> production of more dirty pages by current and b) queuing background writeback to
> the current bdi.
> 
> balance_dirty_pages() receives a mapping and page count, which indicate what may
> have been dirtied and the max number of pages that may have been dirtied.  Due
> to per cpu rate limiting and batching (when nr_pages_dirtied > 0),
> balance_dirty_pages() does not know which memcg were charged for recently dirty
> pages.
> 
> I think both bdi and system limits have the same issue in that a bdi may be
> pushed over its dirty limit but not immediately checked due to rate limits.  If
> future dirtied pages are backed by different bdi, then future
> balance_dirty_page() calls will check the second, compliant bdi ignoring the
> first, over-limit bdi.  The safety net is that the system wide limits are also
> checked in balance_dirty_pages.  However, per bdi writeback is employed in this
> situation.
> 
> Note: This memcg foreign dirtier issue does not make it any more likely that a
> memcg is pushed above its usage limit (limit_in_bytes).  The only limit with
> this weak contract is the dirty limit.
> 
> For reference, this issue was touch on in
> http://marc.info/?l=linux-mm&m=128840780125261
> 
> There are ways to handle this issue (my preferred option is option #1).
> 
> 1) keep a (global?) foreign_dirtied_memcg list of memcg that were recently
>   charged for dirty pages by tasks outside of memcg.  When a memcg dirty page
>   count is elevated, the page's memcg would be queued to the list if current's
>   memcg does not match the pages cgroup.  mem_cgroup_balance_dirty_pages()
>   would balance the current memcg and each memcg it dequeues from this list.
>   This should be a straightforward fix.
> 

Can you implement this in an efficient way ? (without taking any locks ?)
It seems cost > benefit.



> 2) When pages are dirtied, migrate them to the current task's memcg.
>   mem_cgroup_balance_dirty_pages would then have a better chance at seeing all
>   pages dirtied by the current operation.  This is still not perfect solution
>   due to rate limiting.  This also is bad because such a migration would
>   involve charging and possibly memcg direct reclaim because the destination
>   memcg may be at its memory usage limit.  Doing all of this in
>   account_page_dirtied() seems like trouble, so I do not like this approach.
> 

I think this cannot be implemented in an efficnent way.



> 3) Pass in some context which is represents a set of pages recently dirtied into
>   [mem_cgroup]_balance_dirty_pages.  What would be a good context to collect
>   the set of memcg that should be balanced?
>   - an extra passed in parameter - yuck.
>   - address_space extension - does not feel quite right because address space
>     is not a io context object, I presume it can be shared by concurrent
>     threads.
>   - something hanging on current.  Are there cases where pages become dirty
>     that are not followed by a call to balance dirty pages Note: this option
>     (3) is not a good idea because rate limiting make dirty limit enforcement
>     an inexact science.  There is no guarantee that a caller will have context
>     describing the pages (or bdis) recently dirtied.
> 

I'd like to have an option  'cgroup only for file cache' rather than adding more
hooks and complicated operations.

But, if we need to record 'who dirtied ?' information, record it in page_cgroup
or radix-tree and do filtering is what I can consider, now.
In this case, some tracking information will be required to be stored into
struct inode, too.


How about this ?

 1. record 'who dirtied memcg' into page_cgroup or radix-tree.
   I prefer recording in radix-tree rather than using more field in page_cgroup.
 2. bdi-writeback does some extra queueing operation per memcg.
   find a page, check 'who dirtied', enqueue it(using page_cgroup or list of pagevec)
 3. writeback it's own queue.(This can be done before 2. if cgroup has queue, already)
 4. Some GC may be required...

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/