[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150207143839.GA9926@htj.dyndns.org>
Date: Sat, 7 Feb 2015 09:38:39 -0500
From: Tejun Heo <tj@...nel.org>
To: Greg Thelen <gthelen@...gle.com>
Cc: Konstantin Khlebnikov <khlebnikov@...dex-team.ru>,
Johannes Weiner <hannes@...xchg.org>,
Michal Hocko <mhocko@...e.cz>,
Cgroups <cgroups@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Jan Kara <jack@...e.cz>, Dave Chinner <david@...morbit.com>,
Jens Axboe <axboe@...nel.dk>,
Christoph Hellwig <hch@...radead.org>,
Li Zefan <lizefan@...wei.com>, Hugh Dickins <hughd@...gle.com>
Subject: Re: [RFC] Making memcg track ownership per address_space or anon_vma
Hello, Greg.
On Fri, Feb 06, 2015 at 03:43:11PM -0800, Greg Thelen wrote:
> If cgroups are about isolation then writing to shared files should be
> rare, so I'm willing to say that we don't need to handle shared
> writers well. Shared readers seem like a more valuable use cases
> (thin provisioning). I'm getting overwhelmed with the thought
> exercise of automatically moving inodes to common ancestors and back
> charging the sharers for shared_usage. I haven't wrapped my head
> around how these shared data pages will get protected. It seems like
> they'd no longer be protected by child min watermarks.
Yes, this is challenging and what my current thought is around taking
the maximum of the low settings of the sharing children but I need to
think more about it. One problem is that the shared inodes will
preemptively take away the amount shared from the children's low
protection. They won't compete fairly with other inodes or anons but
they can't really as they don't really belong to any single sharer.
> So I know this thread opened with the claim "both memcg and blkcg must
> be looking at the same picture. Deviating them is highly likely to
> lead to long-term issues forcing us to look at this again anyway, only
> with far more baggage." But I'm still wondering if the following is
> simpler:
> (1) leave memcg as a per page controller.
> (2) maintain a per inode i_memcg which is set to the common dirtying
> ancestor. If not shared then it'll point to the memcg that the page
> was charged to.
> (3) when memcg dirtying page pressure is seen, walk up the cgroup tree
> writing dirty inodes, this will write shared inodes using blkcg
> priority of the respective levels.
> (4) background limit wb_check_background_flush() and time based
> wb_check_old_data_flush() can feel free to attack shared inodes to
> hopefully restore them to non-shared state.
> For non-shared inodes, this should behave the same. For shared inodes
> it should only affect those in the hierarchy which is sharing.
The thing which breaks when you de-couple what memcg sees from the
rest of the stack is that the amount of memory which may be available
to a given cgroup and how much of that is dirty is the main linkage
propagating IO pressure to actual dirtying tasks. If you decouple the
two worldviews, you lose the ability to propagate IO pressure to
dirtiers in a controlled manner and that's why anything inside a memcg
currently is always triggering direct reclaim path instead of being
properly dirty throttled.
You can argue that an inode being actively dirtied from multiple
cgroups is a rare case which we can sweep under the rug and that
*might* be the case but I have a nagging feeling that that would be a
decision which is made merely out of immediate convenience and would
much prefer having a well defined model of sharing inodes and anons
across cgroups so that the behaviors shown in thoses cases aren't mere
accidental consequences without any innate meaning.
If we can argue that memcg and blkcg having different views is
meaningful and characterize and justify the behaviors stemming from
the deviation, sure, that'd be fine, but I don't think we have that as
of now.
Thanks.
--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists