linux-kernel - Re: [patch 7/8] mm, memcg: allow processes handling oom notifications to access reserves

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAAAKZwtQc212_-oqf56ToxjSG7f9bsNcBwwurSezpGKiPDT+nQ@mail.gmail.com>
Date:	Thu, 12 Dec 2013 16:23:18 -0800
From:	Tim Hockin <thockin@...kin.org>
To:	Tejun Heo <tj@...nel.org>
Cc:	David Rientjes <rientjes@...gle.com>,
	Johannes Weiner <hannes@...xchg.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Michal Hocko <mhocko@...e.cz>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	Mel Gorman <mgorman@...e.de>, Rik van Riel <riel@...hat.com>,
	Pekka Enberg <penberg@...nel.org>,
	Christoph Lameter <cl@...ux-foundation.org>,
	Li Zefan <lizefan@...wei.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	linux-mm@...ck.org, Cgroups <cgroups@...r.kernel.org>,
	Victor Marmol <vmarmol@...gle.com>
Subject: Re: [patch 7/8] mm, memcg: allow processes handling oom notifications
 to access reserves

On Thu, Dec 12, 2013 at 11:23 AM, Tejun Heo <tj@...nel.org> wrote:
> Hello, Tim.
>
> On Thu, Dec 12, 2013 at 10:42:20AM -0800, Tim Hockin wrote:
>> Yeah sorry.  Replying from my phone is awkward at best.  I know better :)
>
> Heh, sorry about being bitchy. :)
>
>> In my mind, the ONLY point of pulling system-OOM handling into
>> userspace is to make it easier for crazy people (Google) to implement
>> bizarre system-OOM policies.  Example:
>
> I think that's one of the places where we largely disagree.  If at all

Just to be clear - I say this because it doesn't feel right to impose
my craziness on others, and it sucks when we try and are met with
"you're crazy, go away".  And you have to admit that happens to
Google. :)  Punching an escape valve that allows us to be crazy
without hurting anyone else sounds ideal, IF and ONLY IF that escape
valve is itself maintainable.

If the escape valve is userspace it's REALLY easy to iterate on our
craziness.  If it is kernel space, it's somewhat less easy, but not
impossible.

> possible, I'd much prefer google's workload to be supported inside the
> general boundaries of the upstream kernel without having to punch a
> large hole in it.  To me, the general development history of memcg in
> general and this thread in particular seem to epitomize why it is a
> bad idea to have isolated, large and deep "crazy" use cases.  Punching
> the initial hole is the easy part; however, we all are quite limited
> in anticpating future needs and sooner or later that crazy use case is
> bound to evolve further towards the isolated extreme it departed
> towards and require more and larger holes and further contortions to
> accomodate such progress.
>
> The concern I have with the suggested solution is not necessarily that
> it's technically complex than it looks like on the surface - I'm sure
> it can be made to work one way or the other - but that it's a fairly
> large step toward an isolated extreme which memcg as a project
> probably should not head toward.
>
> There sure are cases where such exceptions can't be avoided and are
> good trade-offs but, here, we're talking about a major architectural
> decision which not only affects memcg but mm in general.  I'm afraid
> this doesn't sound like an no-brainer flexibility we can afford.
>
>> When we have a system OOM we want to do a walk of the administrative
>> memcg tree (which is only a couple levels deep, users can make
>> non-admin sub-memcgs), selecting the lowest priority entity at each
>> step (where both tasks and memcgs have a priority and the priority
>> range is much wider than the current OOM scores, and where memcg
>> priority is sometimes a function of memcg usage), until we reach a
>> leaf.
>>
>> Once we reach a leaf, I want to log some info about the memcg doing
>> the allocation, the memcg being terminated, and maybe some other bits
>> about the system (depending on the priority of the selected victim,
>> this may or may not be an "acceptable" situation).  Then I want to
>> kill *everything* under that memcg.  Then I want to "publish" some
>> information through a sane API (e.g. not dmesg scraping).
>>
>> This is basically our policy as we understand it today.  This is
>> notably different than it was a year ago, and it will probably evolve
>> further in the next year.
>
> I think per-memcg score and killing is something which makes
> fundamental sense.  In fact, killing a single process has never made
> much sense to me as that is a unit which ultimately is only meaningful
> to the kernel itself and not necessraily to userland, so no matter
> what I think we're gonna gain per-memcg behavior and it seems most,
> albeit not all, of what you described above should be implementable
> through that.

Well that's an awesome start.  We have or had patches to do a lot of
this.  I don't know how well scrubbed they are for pushing or whether
they apply at all to current head, though.

> Ultimately, if the use case calls for very fine level of control, I
> think the right thing to do is making nesting work properly which is
> likely to take some time.  In the meantime, even if such use case
> requires modifying the kernel to tailor the OOM behavior, I think
> sticking to kernel OOM provides a lot easier way to eventual
> convergence.  Userland system OOM basically means giving up and would
> lessen the motivation towards improving the shared infrastructures
> while adding significant pressure towards schizophreic diversion.
>
>> We have a long tail of kernel memory usage.  If we provision machines
>> so that the "do work here" first-level memcg excludes the average
>> kernel usage, we have a huge number of machines that will fail to
>> apply OOM policy because of actual overcommitment.  If we provision
>> for 95th or 99th percentile kernel usage, we're wasting large amounts
>> of memory that could be used to schedule jobs.  This is the
>> fundamental problem we face with static apportionment (and we face it
>> in a dozen other situations, too).  Expressing this set-aside memory
>> as "off-the-top" rather than absolute limits makes the whole system
>> more flexible.
>
> I agree that's pretty sad.  Maybe I shouldn't be surprised given the
> far-from-perfect coverage of kmemcg at this point, but, again,
> *everyone* wants [k]memcg coverage to be more complete and we have and
> are still building the infrastructures to make that possible, so I'm
> still of the opinion that making [k]memcg work better is the better
> direction to pursue and given the short development history of kmemcg
> I'm fairly sure there are quite a few low hanging fruits.

yes we should fix accounting across the board.  We are hugely in favor
of that.  But I don't buy that we'll erase that tail.  Fundamentally,
we don't know what the limit is, but we know that we need to save a
little "off the top".  I'm very much hoping we can find a way to
express that.

As an aside: mucking about with extra nesting levels to achieve a
stable OOM semantic sounds doable, but it certainly sucks in a unified
hierarchy.  We'll end up with 1, 2, or 3 (or more in esoteric cases?
not sure) extra nesting levels for every other resource dimension.
And lawd help us if we ever need to do something similar in a
different resource dimension - the cross product is mind-bending.
What we do with split-hierarchies is this but on a smaller scale.

> Another thing which *might* be relevant is the rigidity of the upper
> limit and the vagueness of soft limit of the current implementation.
> I have a rather strong suspicion that the way memcg config knobs
> behave now - one finicky, the other whatever - is likely hindering the
> use cases to fan out more naturally.  I could be completely wrong on
> this but your mention of inflexibility of absolute limits reminds me
> of the issue.
>
> Thanks.
>
> --
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/