linux-kernel - Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20151106163555.GB7813@cmpxchg.org>
Date:	Fri, 6 Nov 2015 11:35:55 -0500
From:	Johannes Weiner <hannes@...xchg.org>
To:	Vladimir Davydov <vdavydov@...tuozzo.com>
Cc:	Michal Hocko <mhocko@...nel.org>,
	David Miller <davem@...emloft.net>, akpm@...ux-foundation.org,
	tj@...nel.org, netdev@...r.kernel.org, linux-mm@...ck.org,
	cgroups@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified
 hierarchy

On Fri, Nov 06, 2015 at 12:05:55PM +0300, Vladimir Davydov wrote:
> On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote:
> > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote:
> ...
> > > 3) keep only some (safe) cache types enabled by default with the current
> > >    failing semantic and require an explicit enabling for the complete
> > >    kmem accounting. [di]cache code paths should be quite robust to
> > >    handle allocation failures.
> > 
> > Vladimir, what would be your opinion on this?
> 
> I'm all for this option. Actually, I've been thinking about this since I
> introduced the __GFP_NOACCOUNT flag. Not because of the failing
> semantics, since we can always let kmem allocations breach the limit.
> This shouldn't be critical, because I don't think it's possible to issue
> a series of kmem allocations w/o a single user page allocation, which
> would reclaim/kill the excess.
> 
> The point is there are allocations that are shared system-wide and
> therefore shouldn't go to any memcg. Most obvious examples are: mempool
> users and radix_tree/idr preloads. Accounting them to memcg is likely to
> result in noticeable memory overhead as memory cgroups are
> created/destroyed, because they pin dead memory cgroups with all their
> kmem caches, which aren't tiny.
> 
> Another funny example is objects destroyed lazily for performance
> reasons, e.g. vmap_area. Such objects are usually very small, so
> delaying destruction of a bunch of them will normally go unnoticed.
> However, if kmemcg is used the effective memory consumption caused by
> such objects can be multiplied by many times due to dangling kmem
> caches.
> 
> We can, of course, mark all such allocations as __GFP_NOACCOUNT, but the
> problem is they are tricky to identify, because they are scattered all
> over the kernel source tree. E.g. Dave Chinner mentioned that XFS
> internals do a lot of allocations that are shared among all XFS
> filesystems and therefore should not be accounted (BTW that's why
> list_lru's used by XFS are not marked as memcg-aware). There must be
> more out there. Besides, kernel developers don't usually even know about
> kmemcg (they just write the code for their subsys, so why should they?)
> so they won't care thinking about using __GFP_NOACCOUNT, and hence new
> falsely-accounted allocations are likely to appear.
> 
> That said, by switching from black-list (__GFP_NOACCOUNT) to white-list
> (__GFP_ACCOUNT) kmem accounting policy we would make the system more
> predictable and robust IMO. OTOH what would we lose? Security? Well,
> containers aren't secure IMHO. In fact, I doubt they will ever be (as
> secure as VMs). Anyway, if a runaway allocation is reported, it should
> be trivial to fix by adding __GFP_ACCOUNT where appropriate.

I wholeheartedly agree with all of this.

> If there are no objections, I'll prepare a patch switching to the
> white-list approach. Let's start from obvious things like fs_struct,
> mm_struct, task_struct, signal_struct, dentry, inode, which can be
> easily allocated from user space. This should cover 90% of all
> allocations that should be accounted AFAICS. The rest will be added
> later if necessarily.

Awesome, I'm looking forward to that patch!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/