linux-kernel - Re: [PATCH] memcg: introduce per-memcg reclaim interface

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20200922165527.GD12990@dhcp22.suse.cz>
Date:   Tue, 22 Sep 2020 18:55:27 +0200
From:   Michal Hocko <mhocko@...e.com>
To:     Shakeel Butt <shakeelb@...gle.com>
Cc:     Johannes Weiner <hannes@...xchg.org>, Roman Gushchin <guro@...com>,
        Greg Thelen <gthelen@...gle.com>,
        David Rientjes <rientjes@...gle.com>,
        Michal Koutný <mkoutny@...e.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Linux MM <linux-mm@...ck.org>,
        Cgroups <cgroups@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Yang Shi <shy828301@...il.com>
Subject: Re: [PATCH] memcg: introduce per-memcg reclaim interface

On Tue 22-09-20 08:54:25, Shakeel Butt wrote:
> On Tue, Sep 22, 2020 at 4:49 AM Michal Hocko <mhocko@...e.com> wrote:
> >
> > On Mon 21-09-20 10:50:14, Shakeel Butt wrote:
[...]
> > > Let me add one more point. Even if the high limit reclaim is swift, it
> > > can still take 100s of usecs. Most of our jobs are anon-only and we
> > > use zswap. Compressing a page can take a couple usec, so 100s of usecs
> > > in limit reclaim is normal. For latency sensitive jobs, this amount of
> > > hiccups do matters.
> >
> > Understood. But isn't this an implementation detail of zswap? Can it
> > offload some of the heavy lifting to a different context and reduce the
> > general overhead?
> >
> 
> Are you saying doing the compression asynchronously? Similar to how
> the disk-based swap triggers the writeback and puts the page back to
> LRU, so the next time reclaim sees it, it will be instantly reclaimed?
> Or send the batch of pages to be compressed to a different CPU and
> wait for the completion?

Yes.

[...]

> > You are right that misconfigured limits can result in problems. But such
> > a configuration should be quite easy to spot which is not the case for
> > targetted reclaim calls which do not leave any footprints behind.
> > Existing interfaces are trying to not expose internal implementation
> > details as much as well. You are proposing a very targeted interface to
> > fine control the memory reclaim. There is a risk that userspace will
> > start depending on a specific reclaim implementation/behavior and future
> > changes would be prone to regressions in workloads relying on that. So
> > effectively, any user space memory reclaimer would need to be tuned to a
> > specific implementation of the memory reclaim.
> 
> I don't see the exposure of internal memory reclaim implementation.
> The interface is very simple. Reclaim a given amount of memory. Either
> the kernel will reclaim less memory or it will over reclaim. In case
> of reclaiming less memory, the user space can retry given there is
> enough reclaimable memory. For the over reclaim case, the user space
> will backoff for a longer time. How are the internal reclaim
> implementation details exposed?

In an ideal world yes. A feedback mechanism will be independent on the
particular implementation. But the reality tends to disagree quite
often. Once we provide a tool there will be users using it to the best
of their knowlege. Very often as a hammer. This is what the history of
kernel regressions and "we have to revert an obvious fix because
userspace depends on an undocumented behavior which happened to work for
some time" has thought us in a hard way.

I really do not want to deal with reports where a new heuristic in the
memory reclaim will break something just because the reclaim takes
slightly longer or over/under reclaims differently so the existing
assumptions break and the overall balancing from userspace breaks.

This might be a shiny exception of course. And please note that I am not
saying that the interface is completely wrong or unacceptable. I just
want to be absolutely sure we cannot move forward with the existing API
space that we have.

So far I have learned that you are primarily working around an
implementation detail in the zswap which is doing the swapout path
directly in the pageout path. That sounds like a very bad reason to add
a new interface. You are right that there are likely other usecases to
like this new interface - mostly to emulate drop_caches - but I believe
those are quite misguided as well and we should work harder to help
them out to use the existing APIs. Last but not least the memcg
background reclaim is something that should be possible without a new
interface.
-- 
Michal Hocko
SUSE Labs