linux-kernel - Re: Will the recent memory leak fixes be backported to longterm kernels?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20181102162237.GB17619@tower.DHCP.thefacebook.com>
Date:   Fri, 2 Nov 2018 16:22:41 +0000
From:   Roman Gushchin <guro@...com>
To:     Michal Hocko <mhocko@...nel.org>
CC:     Dexuan Cui <decui@...rosoft.com>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Kernel Team <Kernel-team@...com>,
        "Shakeel Butt" <shakeelb@...gle.com>,
        Johannes Weiner <hannes@...xchg.org>,
        Tejun Heo <tj@...nel.org>, Rik van Riel <riel@...riel.com>,
        Konstantin Khlebnikov <koct9i@...il.com>,
        Matthew Wilcox <willy@...radead.org>,
        "Stable@...r.kernel.org" <Stable@...r.kernel.org>
Subject: Re: Will the recent memory leak fixes be backported to longterm
 kernels?

On Fri, Nov 02, 2018 at 05:13:14PM +0100, Michal Hocko wrote:
> On Fri 02-11-18 15:48:57, Roman Gushchin wrote:
> > On Fri, Nov 02, 2018 at 09:03:55AM +0100, Michal Hocko wrote:
> > > On Fri 02-11-18 02:45:42, Dexuan Cui wrote:
> > > [...]
> > > > I totally agree. I'm now just wondering if there is any temporary workaround,
> > > > even if that means we have to run the kernel with some features disabled or
> > > > with a suboptimal performance?
> > > 
> > > One way would be to disable kmem accounting (cgroup.memory=nokmem kernel
> > > option). That would reduce the memory isolation because quite a lot of
> > > memory will not be accounted for but the primary source of in-flight and
> > > hard to reclaim memory will be gone.
> > 
> > In my experience disabling the kmem accounting doesn't really solve the issue
> > (without patches), but can lower the rate of the leak.
> 
> This is unexpected. 90cbc2508827e was introduced to address offline
> memcgs to be reclaim even when they are small. But maybe you mean that
> we still leak in an absence of the memory pressure. Or what does prevent
> memcg from going down?

There are 3 independent issues which are contributing to this leak:
1) Kernel stack accounting weirdness: processes can reuse stack accounted to
different cgroups. So basically any running process can take a reference to any
cgroup.
2) We do forget to scan the last page in the LRU list. So if we ended up with
1-page long LRU, it can stay there basically forever.
3) We don't apply enough pressure on slab objects.

Because one reference is enough to keep the entire memcg structure in place,
we really have to close all three to eliminate the leak. Disabling kmem
accounting mitigates only the last one.

> 
> > > Another workaround could be to use force_empty knob we have in v1 and
> > > use it when removing a cgroup. We do not have it in cgroup v2 though.
> > > The file hasn't been added to v2 because we didn't really have any
> > > proper usecase. Working around a bug doesn't sound like a _proper_
> > > usecase but I can imagine workloads that bring a lot of metadata objects
> > > that are not really interesting for later use so something like a
> > > targeted drop_caches...
> > 
> > This can help a bit too, but even using the system-wide drop_caches knob
> > unfortunately doesn't return all the memory back.
> 
> Could you be more specific please?

Sure, because problems 1) and 2) exist, echo 3 > /proc/sys/vm/drop_caches can't
reclaim all memcg structures in most cases.

Thanks!