linux-kernel - Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20190226220756.GA25821@tower.DHCP.thefacebook.com>
Date:   Tue, 26 Feb 2019 22:08:02 +0000
From:   Roman Gushchin <guro@...com>
To:     Andrey Ryabinin <aryabinin@...tuozzo.com>
CC:     Andrew Morton <akpm@...ux-foundation.org>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Johannes Weiner <hannes@...xchg.org>,
        "Michal Hocko" <mhocko@...nel.org>,
        Vlastimil Babka <vbabka@...e.cz>,
        Rik van Riel <riel@...riel.com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Shakeel Butt <shakeelb@...gle.com>
Subject: Re: [PATCH RFC] mm/vmscan: try to protect active working set of
 cgroup from reclaim.

On Tue, Feb 26, 2019 at 06:36:38PM +0300, Andrey Ryabinin wrote:
> 
> 
> On 2/25/19 7:03 AM, Roman Gushchin wrote:
> > On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote:
> >> In a presence of more than 1 memory cgroup in the system our reclaim
> >> logic is just suck. When we hit memory limit (global or a limit on
> >> cgroup with subgroups) we reclaim some memory from all cgroups.
> >> This is sucks because, the cgroup that allocates more often always wins.
> >> E.g. job that allocates a lot of clean rarely used page cache will push
> >> out of memory other jobs with active relatively small all in memory
> >> working set.
> >>
> >> To prevent such situations we have memcg controls like low/max, etc which
> >> are supposed to protect jobs or limit them so they to not hurt others.
> >> But memory cgroups are very hard to configure right because it requires
> >> precise knowledge of the workload which may vary during the execution.
> >> E.g. setting memory limit means that job won't be able to use all memory
> >> in the system for page cache even if the rest the system is idle.
> >> Basically our current scheme requires to configure every single cgroup
> >> in the system.
> >>
> >> I think we can do better. The idea proposed by this patch is to reclaim
> >> only inactive pages and only from cgroups that have big
> >> (!inactive_is_low()) inactive list. And go back to shrinking active lists
> >> only if all inactive lists are low.
> > 
> > Hi Andrey!
> > 
> > It's definitely an interesting idea! However, let me bring some concerns:
> > 1) What's considered active and inactive depends on memory pressure inside
> > a cgroup.
> 
> There is no such dependency. High memory pressure may be generated both
> by active and inactive pages. We also can have a cgroup creating no pressure
> with almost only active (or only inactive) pages.
> 
> > Actually active pages in one cgroup (e.g. just deleted) can be colder
> > than inactive pages in an other (e.g. a memory-hungry cgroup with a tight
> > memory.max).
> > 
> 
> Well, yes, this is a drawback of having per-memcg lrus.
> 
> > Also a workload inside a cgroup can to some extend control what's going
> > to the active LRU. So it opens a way to get more memory unfairly by
> > artificially promoting more pages to the active LRU. So a cgroup
> > can get an unfair advantage over other cgroups.
> > 
> 
> Unfair is usually a negative term, but in this case it's very much depends on definition of what is "fair".
> 
> If fair means to put equal reclaim pressure on all cgroups, than yes, the patch
> increases such unfairness, but such unfairness is a good thing.
> Obviously it's more valuable to keep in memory actively used page than the page that not used.

I think that fairness is good here.

> 
> > Generally speaking, now we have a way to measure the memory pressure
> > inside a cgroup. So, in theory, it should be possible to balance
> > scanning effort based on memory pressure.
> > 
> 
> Simply by design, the inactive pages are the first candidates to reclaim.
> Any decision that doesn't take into account inactive pages probably would be wrong.
> 
> E.g. cgroup A with active job loading a big and active working set which creates high memory pressure
> and cgroup B - idle (no memory pressure) with a huge not used cache.
> It's definitely preferable to reclaim from B rather than from A.
>

For sure, if we're reclaiming hot pages instead of cold, it's bad for the
overall performance. But active and inactive LRUs are just an approximation of
what is hot and cold. E.g. I will run "cat some_large_file" twice in a cgroup,
and the whole file will reside in the active LRU and considered hot. Even if
nobody will ever use it again.

So it means that depending on memory usage pattern, some workloads will benefit
from your change, and some will suffer.

Btw, what will be with protected cgroups (with memory.low set)?
Those will still affect global scanning decisions (active/inactive ratio),
but will be exempted from scanning?

Thanks!