linux-kernel - Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20171221172930.GF1084507@devbig577.frc2.facebook.com>
Date:   Thu, 21 Dec 2017 09:29:30 -0800
From:   Tejun Heo <tj@...nel.org>
To:     Shakeel Butt <shakeelb@...gle.com>
Cc:     Michal Hocko <mhocko@...nel.org>, Li Zefan <lizefan@...wei.com>,
        Roman Gushchin <guro@...com>,
        Vladimir Davydov <vdavydov.dev@...il.com>,
        Greg Thelen <gthelen@...gle.com>,
        Johannes Weiner <hannes@...xchg.org>,
        Hugh Dickins <hughd@...gle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Linux MM <linux-mm@...ck.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Cgroups <cgroups@...r.kernel.org>, linux-doc@...r.kernel.org
Subject: Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2

Hello, Shakeel.

On Thu, Dec 21, 2017 at 07:22:20AM -0800, Shakeel Butt wrote:
> I am claiming memory allocations under global pressure will be
> affected by the performance of the underlying swap device. However
> memory allocations under memcg memory pressure, with memsw, will not
> be affected by the performance of the underlying swap device. A job
> having 100 MiB limit running on a machine without global memory
> pressure will never see swap on hitting 100 MiB memsw limit.

But, without global memory pressure, the swap wouldn't be making any
difference to begin with.  Also, when multiple cgroups are hitting
memsw limits, they'd behave as if swappiness is zero increasing load
on the filesystems, which then then of course will affect everyone
under memory pressure whether memsw or not.

> > On top of that, what's the point?
> >
> > 1. As I wrote earlier, given the current OOM killer implementation,
> >    whether OOM kicks in or not is not even that relevant in
> >    determining the health of the workload.  There are frequent failure
> >    modes where OOM killer fails to kick in while the workload isn't
> >    making any meaningful forward progress.
> >
> 
> Deterministic oom-killer is not the point. The point is to
> "consistently limit the anon memory" allocated by the job which only
> memsw can provide. A job owner who has requested 100 MiB for a job
> sees some instances of the job suffer at 100 MiB and other instances
> suffer at 150 MiB, is an inconsistent behavior.

So, the first part, I get.  memsw happens to be be able to limit the
amount of anon memory.  I really don't think that was the intention
but more of a byproduct that some people might find useful.

The example you listed tho doesn't make much sense to me.  Given two
systems with differing level of memory pressures, two instances can
see wildly different performance regardless of memsw.

> > 2. On hitting memsw limit, the OOM decision is dependent on the
> >    performance of the file backing devices.  Why is that necessarily
> >    better than being dependent on swap or both, which would increase
> >    the reclaim efficiency anyway?  You can't avoid being affected by
> >    the underlying hardware one way or the other.
> 
> This is a separate discussion but still the amount of file backed
> pages is known and controlled by the job owner and they have the
> option to use a storage service, providing a consistent performance
> across different data centers, instead of the physical disks of the
> system where the job is running and thus isolating the job's
> performance from the speed of the local disk. This is not possible
> with swap. The swap (and its performance) is and should be transparent
> to the job owners.

And, for your use case, there is a noticeable difference between file
backed and anonymous memories and that's why you want to limit
anonymous memory independently from file backed memory.

It looks like what you actually want is limiting the amount of
anonymous memory independently from file-backed consumptions because,
in your setup, while swap is always on local disk the file storages
are over network and more configurable / flexible.

Assuming I'm not misunderstanding you, here are my thoughts.

* I'm not sure that distinguishing anon and file backed memories like
  that is the direction we want to head.  In fact, the more uniform we
  can behave across them, the more efficient we'd be as we wouldn't
  have that artificial barrier.  It is true that we don't have the
  same level of control for swap tho.

* Even if we want an independent anon limit, memsw isn't the solution.
  It's too conflated.  If you want to have anon limit, the right thing
  to do would be pushing for an independent anon limit, not memsw.

Thanks.

-- 
tejun