linux-kernel - Re: [RFC] Making memcg track ownership per address_space or anon

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Fri, 30 Jan 2015 01:27:37 -0500
From:	Tejun Heo <tj@...nel.org>
To:	Greg Thelen <gthelen@...gle.com>
Cc:	Johannes Weiner <hannes@...xchg.org>,
	Michal Hocko <mhocko@...e.cz>, cgroups@...r.kernel.org,
	linux-mm@...ck.org, linux-kernel@...r.kernel.org,
	Jan Kara <jack@...e.cz>, Dave Chinner <david@...morbit.com>,
	Jens Axboe <axboe@...nel.dk>,
	Christoph Hellwig <hch@...radead.org>,
	Li Zefan <lizefan@...wei.com>, hughd@...gle.com,
	Konstantin Khebnikov <khlebnikov@...dex-team.ru>
Subject: Re: [RFC] Making memcg track ownership per address_space or anon_vma

Hello, Greg.

On Thu, Jan 29, 2015 at 09:55:53PM -0800, Greg Thelen wrote:
> I find simplification appealing.  But I not sure it will fly, if for no
> other reason than the shared accountings.  I'm ignoring intentional
> sharing, used by carefully crafted apps, and just thinking about
> incidental sharing (e.g. libc).
> 
> Example:
> 
> $ mkdir small
> $ echo 1M > small/memory.limit_in_bytes
> $ (echo $BASHPID > small/cgroup.procs && exec sleep 1h) &
> 
> $ mkdir big
> $ echo 10G > big/memory.limit_in_bytes
> $ (echo $BASHPID > big/cgroup.procs && exec mlockall_database 1h) &
> 
> Assuming big/mlockall_database mlocks all of libc, then it will oom kill
> the small memcg because libc is owned by small due it having touched it
> first.  It'd be hard to figure out what small did wrong to deserve the
> oom kill.

The previous behavior was pretty unpredictable in terms of shared file
ownership too.  I wonder whether the better thing to do here is either
charging cases like this to the common ancestor or splitting the
charge equally among the accessors, which might be doable for ro
files.

> FWIW we've been using memcg writeback where inodes have a memcg
> writeback owner.  Once multiple memcg write to an inode then the inode
> becomes writeback shared which makes it more likely to be written.  Once
> cleaned the inode is then again able to be privately owned:
> https://lkml.org/lkml/2011/8/17/200

The problem is that it introduces deviations between memcg and
writeback / blkcg which will mess up pressure propagation.  Writeback
pressure can't be determined without its associated memcg and neither
can dirty balancing.  We sure can simplify things by trading off
accuracies at places but let's please try to do that throughout the
stack, not in the midpoint, so that we can say "if you do this, it'll
behave this way and you can see it showing up there".  The thing is if
we leave it half-way, in time, some will try to actively exploit
memcg's page granularity and we'll have to deal with writeback
behavior which is difficult to even characterize.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/