linux-kernel - Re: [PATCH] mm/vmscan: add sysctl knobs for protecting the working set

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YbcNUEZ08lmbv0RM@dhcp22.suse.cz>
Date:   Mon, 13 Dec 2021 10:07:28 +0100
From:   Michal Hocko <mhocko@...e.com>
To:     Alexey Avramov <hakavlad@...ox.lv>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        ValdikSS <iam@...dikss.org.ru>, linux-mm@...ck.org,
        linux-doc@...r.kernel.org, linux-fsdevel@...r.kernel.org,
        linux-kernel@...r.kernel.org, corbet@....net, mcgrof@...nel.org,
        keescook@...omium.org, yzaikin@...gle.com,
        oleksandr@...alenko.name, kernel@...mod.org, aros@....com,
        hakavlad@...il.com, Yu Zhao <yuzhao@...gle.com>,
        Johannes Weiner <hannes@...xchg.org>,
        Suren Baghdasaryan <surenb@...gle.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        Mel Gorman <mgorman@...hsingularity.net>, hdanton@...a.com,
        riel@...riel.com, Shakeel Butt <shakeelb@...gle.com>
Subject: Re: [PATCH] mm/vmscan: add sysctl knobs for protecting the working
 set

On Mon 13-12-21 05:15:21, Alexey Avramov wrote:
> So, the problem described by Artem S. Tashkinov in 2019 is still easily 
> reproduced in 2021. The assurances of the maintainers that they consider 
> the thrashing and near-OOM stalls to be a serious problems are difficult to 
> take seriously while they ignore the obvious solution: if reclaiming file 
> caches leads to thrashing, then you just need to prohibit deleting the file 
> cache. And allow the user to control its minimum amount.

These are rather strong claims. While this might sound like a very easy
solution/workaround I have already tried to express my concerns [1].

Really, you should realize that such a knob would become carved
into stone as soon as wee merge this and we will need to support it
for ever! It is really painful (if possible at all) to deprecate any
tunable knobs that cannot be supported anymore because the underlying
implementation doesn't allow for that.  So we would absolutely need to
be sure this is the right approach to the problem.  I am not convinced
about that though.

How does the admin know the limit should be set to a certain
workload? What if the workload characteristics change and the existing
setting is just to restrictive? What if the workload istrashing over
something different than anon/file memory (e.g. any other cache that we
have or might have in the future)?

As you have pointed out there were general recommendations to use user
space based oom killer solutions which can be tuned for the specific
workload or used in an environment where the disruptive OOM killer
action is less of a problem because workload can be restarted easily
without too much harm caused by the oom killer.
Please keep in mind that there are many more different workloads that
have different requirements and an oom killer invocation can be really
much worse than a slow progress due to ephemeral, peak or even longer
term trashing or heavy refaults.

The kernel OOM killer acts as the last resort solution and therefore
stays really conservative. I do believe that integrating PSI metrics
into that decision is the right direction. It is not a trivial one
though.

Why is this better approach than a simple limit? Well, for one, it is a
feedback based solution. System knows it is trashing and can estimate
how hard. It is not about a specific type of memory because we can
detect refaults on both file and anonymous memory (it can be extended
should there be a need for future types of caches or reclaimable
memory). Memory reclaim can work with that information and balance
differen resources dynamically based on the available feedback. MM code
will not need to expose implementation details about how the reclaim
works and so we do not bind ourselves into longterm specifics.

See the difference?

If you can live with pre-mature and over-eager OOM killer policy then
all fine. Use existing userspace solutions. If you want to work on an in
kernel solution please try to understand complexity and historical
experience with similar solution first. It also helps to understand that
there are no simple solutions on the table. MM reclaim code has evolved
over many years. I am strongly suspecting we ran out of simple solutions
already. We also got burnt many times. Let's not repeat some errors
again.

[1] http://lkml.kernel.org/r/Ya3fG2rp+860Yb+t@dhcp22.suse.cz

-- 
Michal Hocko
SUSE Labs