linux-kernel - Re: [PATCH v6 6/9] mm: multigenerational lru: aging

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <Yd109jeRllJbjV9o@dhcp22.suse.cz>
Date:   Tue, 11 Jan 2022 13:15:50 +0100
From:   Michal Hocko <mhocko@...e.com>
To:     Alexey Avramov <hakavlad@...ox.lv>
Cc:     Yu Zhao <yuzhao@...gle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Andi Kleen <ak@...ux.intel.com>,
        Catalin Marinas <catalin.marinas@....com>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        Hillf Danton <hdanton@...a.com>, Jens Axboe <axboe@...nel.dk>,
        Jesse Barnes <jsbarnes@...gle.com>,
        Johannes Weiner <hannes@...xchg.org>,
        Jonathan Corbet <corbet@....net>,
        Matthew Wilcox <willy@...radead.org>,
        Mel Gorman <mgorman@...e.de>,
        Michael Larabel <Michael@...haellarabel.com>,
        Rik van Riel <riel@...riel.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        Will Deacon <will@...nel.org>,
        Ying Huang <ying.huang@...el.com>,
        linux-arm-kernel@...ts.infradead.org, linux-doc@...r.kernel.org,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        page-reclaim@...gle.com, x86@...nel.org,
        Konstantin Kharlamov <Hi-Angel@...dex.ru>
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging

On Tue 11-01-22 11:21:48, Alexey Avramov wrote:
> > I do not really see any arguments why an userspace based trashing
> > detection cannot be used for those.
> 
> Firsly,
> because this is the task of the kernel, not the user space.
> Memory is managed by the kernel, not by the user space.
> The absence of such a mechanism in the kernel is a fundamental problem.
> The userspace tools are ugly hacks:
> some of them consume a lot of CPU [1],
> some of them consume a lot of memory [2],
> some of them cannot into process_mrelease() (earlyoom, nohang),
> some of them kill only the whole cgroup (systemd-oomd, oomd) [3]
> and depends on systemd and cgroup_v2 (oomd, systemd-oomd).

Thanks for those links. Read through them and my understanding is that
most of those are very specific to the tool used and nothing really
fundamental because of lack of kernel support.

> One of the biggest challenges for userspace oom-killers is to potentially
> function under intense memory pressure and are prone to getting stuck in
> memory reclaim themselves [4].

This one is more interesting and the truth is that handling the complete
OOM situation from the userspace is really tricky. Especially when with
a more complex oom decision policy. In the past we have discussed
potential ways to implement a oom kill policy be kernel modules or eBPF.
Without anybody following up on that.

But I suspect you are mixing up two things here. One of them is out
of memory situation where no memory can be reclaimed and allocated.

The other is one where the memory can be reclaimed, a progress is made,
but that leads to a trashing when the most of the time is spent on
refaulting a memory reclaimed shortly before.

The first one is addressed by the global oom killer and it tries to
be really conservative as much as possible because this is a very
disruptive operation. But the later one is more complex and a proper
handling really depends on the particular workload to be handled
properly because it is more of a QoS than an emergency action to keep
the system alive.

There are workloads which prefer a temporary trashing over its working
set during a peak memory demand rather than an OOM kill because way too
much work would be thrown away. On the other side workloads that are
latency sensitive can see even the direct reclaim as a runtime visible
problem.

I hope you can imagine there is a really large gap between those
two cases and no simple solution can be applied to the whole
range. Therefore we have PSI and refault stats exported to the userspace
so that a workload specific policy can be implemented there.

If userspace has hard time to use that data and action upon then let's
talk about specifics. For the most steady trashing situations I have
seen the userspace with mlocked memory and the code can make a forward
progress and mediate the situation.

[...]

> [1] https://github.com/facebookincubator/oomd/issues/79
> [2] https://github.com/hakavlad/nohang#memory-and-cpu-usage
> [3] https://github.com/facebookincubator/oomd/issues/125
> [4] https://lore.kernel.org/all/CALvZod7vtDxJZtNhn81V=oE-EPOf=4KZB2Bv6Giz+u3bFFyOLg@mail.gmail.com/
> [5] https://github.com/zen-kernel/zen-kernel/issues/223
> [6] https://raw.githubusercontent.com/hakavlad/cache-tests/main/mg-LRU-v3_vs_classic-LRU/3-firefox-tail-OOM/mg-LRU-1/psi2
> [7] https://lore.kernel.org/linux-mm/20211202150614.22440-1-mgorman@techsingularity.net/
-- 
Michal Hocko
SUSE Labs