linux-kernel - Re: [PATCH] mm: memcontrol: avoid workload stalls when lowering memory.high

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20200714155017.GQ24642@dhcp22.suse.cz>
Date:   Tue, 14 Jul 2020 17:50:17 +0200
From:   Michal Hocko <mhocko@...nel.org>
To:     Shakeel Butt <shakeelb@...gle.com>
Cc:     Roman Gushchin <guro@...com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Johannes Weiner <hannes@...xchg.org>,
        Linux MM <linux-mm@...ck.org>,
        Kernel Team <kernel-team@...com>,
        LKML <linux-kernel@...r.kernel.org>,
        Domas Mituzas <domas@...com>, Tejun Heo <tj@...nel.org>,
        Chris Down <chris@...isdown.name>
Subject: Re: [PATCH] mm: memcontrol: avoid workload stalls when lowering
 memory.high

On Tue 14-07-20 08:32:09, Shakeel Butt wrote:
> On Tue, Jul 14, 2020 at 1:41 AM Michal Hocko <mhocko@...nel.org> wrote:
> >
> > On Fri 10-07-20 12:19:37, Shakeel Butt wrote:
> > > On Fri, Jul 10, 2020 at 11:42 AM Roman Gushchin <guro@...com> wrote:
> > > >
> > > > On Fri, Jul 10, 2020 at 07:12:22AM -0700, Shakeel Butt wrote:
> > > > > On Fri, Jul 10, 2020 at 5:29 AM Michal Hocko <mhocko@...nel.org> wrote:
> > > > > >
> > > > > > On Thu 09-07-20 12:47:18, Roman Gushchin wrote:
> > > > > > > Memory.high limit is implemented in a way such that the kernel
> > > > > > > penalizes all threads which are allocating a memory over the limit.
> > > > > > > Forcing all threads into the synchronous reclaim and adding some
> > > > > > > artificial delays allows to slow down the memory consumption and
> > > > > > > potentially give some time for userspace oom handlers/resource control
> > > > > > > agents to react.
> > > > > > >
> > > > > > > It works nicely if the memory usage is hitting the limit from below,
> > > > > > > however it works sub-optimal if a user adjusts memory.high to a value
> > > > > > > way below the current memory usage. It basically forces all workload
> > > > > > > threads (doing any memory allocations) into the synchronous reclaim
> > > > > > > and sleep. This makes the workload completely unresponsive for
> > > > > > > a long period of time and can also lead to a system-wide contention on
> > > > > > > lru locks. It can happen even if the workload is not actually tight on
> > > > > > > memory and has, for example, a ton of cold pagecache.
> > > > > > >
> > > > > > > In the current implementation writing to memory.high causes an atomic
> > > > > > > update of page counter's high value followed by an attempt to reclaim
> > > > > > > enough memory to fit into the new limit. To fix the problem described
> > > > > > > above, all we need is to change the order of execution: try to push
> > > > > > > the memory usage under the limit first, and only then set the new
> > > > > > > high limit.
> > > > > >
> > > > > > Shakeel would this help with your pro-active reclaim usecase? It would
> > > > > > require to reset the high limit right after the reclaim returns which is
> > > > > > quite ugly but it would at least not require a completely new interface.
> > > > > > You would simply do
> > > > > >         high = current - to_reclaim
> > > > > >         echo $high > memory.high
> > > > > >         echo infinity > memory.high # To prevent direct reclaim
> > > > > >                                     # allocation stalls
> > > > > >
> > > > >
> > > > > This will reduce the chance of stalls but the interface is still
> > > > > non-delegatable i.e. applications can not change their own memory.high
> > > > > for the use-cases like application controlled proactive reclaim and
> > > > > uswapd.
> > > >
> > > > Can you, please, elaborate a bit more on this? I didn't understand
> > > > why.
> > > >
> > >
> > > Sure. Do we want memory.high a CFTYPE_NS_DELEGATABLE type file? I
> > > don't think so otherwise any job on a system can change their
> > > memory.high and can adversely impact the isolation and memory
> > > scheduling of the system.
> >
> > Is this really the case? There should always be a parent cgroup that
> > overrides the setting.
> 
> Can you explain a bit more? I don't see any requirement of having a
> layer of cgroup between root and the job cgroup. Internally we
> schedule jobs as top level cgroups. There do exist jobs which are a
> combination of other jobs and there we do use an additional layer of
> cgroup (similar to pods running multiple containers in kubernetes).
> Surely we can add a layer for all the jobs but it comes with an
> overhead and at scale that overhead is not negligible.

What I've had in mind is that if you want to delegate then you have an
option to add a layer where you pre define restrictions/guanratees so
that the delegated cgroup under that hierarchy cannot runaway. So
configuring high limit in a delegated cgroup should be reasonably safe.

> > Also you can always set the hard limit if you do
> > not want to add another layer of cgroup in the hierarchy before
> > delegation. Or am I missing something?
> >
> 
> Yes, we can set memory.max though it has different oom semantics and
> not really a replacement for memory.high.

Right but you can define a safe cap this way and leave the high
watermark for the delegated cgroup.
-- 
Michal Hocko
SUSE Labs