lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150903093608.GA2346@esperanza>
Date:	Thu, 3 Sep 2015 12:36:08 +0300
From:	Vladimir Davydov <vdavydov@...allels.com>
To:	Christoph Lameter <cl@...ux.com>
CC:	Michal Hocko <mhocko@...nel.org>,
	Johannes Weiner <hannes@...xchg.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Pekka Enberg <penberg@...nel.org>,
	David Rientjes <rientjes@...gle.com>,
	Joonsoo Kim <iamjoonsoo.kim@....com>,
	Tejun Heo <tj@...nel.org>, <linux-mm@...ck.org>,
	<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is
 enabled

On Wed, Sep 02, 2015 at 01:16:47PM -0500, Christoph Lameter wrote:
> On Wed, 2 Sep 2015, Vladimir Davydov wrote:
> 
> > Slab is a kind of abnormal alloc_pages user. By calling alloc_pages_node
> > with __GFP_THISNODE and w/o __GFP_WAIT before falling back to
> > alloc_pages with the caller's context, it does the job normally done by
> > alloc_pages itself. It's not what is done massively.
> >
> > Leaving slab charge path as is looks really ugly to me. Look, slab
> > iterates over all nodes, inspecting if they have free pages and fails
> > even if they do due to the memcg constraint...
> 
> Well yes it needs to do that due to the way NUMA support was designed in.
> SLAB needs to check the per node caches if objects are present before
> going to more remote nodes. Sorry about this. I realized the design issue
> in 2006 and SLUB was the result in 2007 of an alternate design to let the
> page allocator do its proper job.

Yeah, SLUB is OK in this respect.

> 
> > To sum it up. Basically, there are two ways of handling kmemcg charges:
> >
> >  1. Make the memcg try_charge mimic alloc_pages behavior.
> >  2. Make API functions (kmalloc, etc) work in memcg as if they were
> >     called from the root cgroup, while keeping interactions between the
> >     low level subsys (slab) and memcg private.
> >
> > Way 1 might look appealing at the first glance, but at the same time it
> > is much more complex, because alloc_pages has grown over the years to
> > handle a lot of subtle situations that may arise on global memory
> > pressure, but impossible in memcg. What does way 1 give us then? We
> > can't insert try_charge directly to alloc_pages and have to spread its
> > calls all over the code anyway, so why is it better? Easier to use it in
> > places where users depend on buddy allocator peculiarities? There are
> > not many such users.
> 
> Would it be possible to have a special alloc_pages_memcg with different
> semantics?
> 
> On the other hand alloc_pages() has grown to handle all the special cases.
> Why cant it also handle the special memcg case? There are numerous other

Because we don't want to place memcg handling in alloc_pages(). AFAIU
this is because memcg by its design works at a higher layer than buddy
alloc. We can't just charge a page on alloc and uncharge it on free.
Sometimes we need to charge a page to a memcg which is different from
the current one, sometimes we need to move a page charge between cgroups
adjusting lru in the meantime (e.g. for handling readahead or swapin).
Placing memcg charging in alloc_pages() would IMO only obscure memcg
logic, because handling of the same page would be spread over subsystems
at different layers. I may be completely wrong though.

> allocators that cache memory in the kernel from networking to
> the bizarre compressed swap approaches. How does memcg handle that? Isnt

Frontswap/zswap entries are accounted to memsw counter like conventional
swap. I don't think we need to charge them to mem, because zswap size is
limited. The user allows to use some RAM as swap transparently to
running processes, so charging them to mem would be unexpected IMO.

Skbs are charged to a different counter, but not charged to kmem for
now. It is to be fixed.

> that situation similar to what the slab allocators do?

I wouldn't say so. Other users just use kmalloc or alloc_pages to grow
their buffers. kmalloc is accounted. For those who work at page
granularity and hence call alloc_pages directly, there is
alloc_kmem_pages helper.

> 
> > exists solely for memcg-vs-list_lru and memcg-vs-slab interactions. We
> > even handle kmem_cache destruction on memcg offline differently for SLAB
> > and SLUB for performance reasons.
> 
> Ugly. Internal allocator design impacts container handling.

The point is that memcg charges pages, while kmalloc works at a finer
level of granularity. As a result, we have two orthogonal strategies for
charging kmalloc:

 1. Teach memcg charge arbitrarily sized chunks and store info about
    memcg near each active object in slab.
 2. Create per memcg copy of each kmem cache (this is the scheme that is
    in use currently).

Whichever way we choose, memcg and slab have to cooperate and so slab
internal design impacts memcg handling.

> 
> > Way 2 gives us more space to maneuver IMO. SLAB/SLUB may do weird tricks
> > for optimization, but their API is well defined, so we just make kmalloc
> > work as expected while providing inter-subsys calls, like
> > memcg_charge_slab, for SLAB/SLUB that have their own conventions. You
> > mentioned kmem users that allocate memory using alloc_pages. There is an
> > API function for them too, alloc_kmem_pages. Everything behind the API
> > is hidden and may be done in such a way to achieve optimal performance.
> 
> Can we also hide cgroups memory handling behind the page based schemes
> without having extra handling for the slab allocators?
> 

I doubt so - see above.

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ