linux-kernel - Re: [v8 0/4] cgroup-aware OOM killer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAAAKZwtApj-FgRc2V77nEb3BUd97Rwhgf-b-k0zhf1u+Y4fqxA@mail.gmail.com>
Date:   Wed, 27 Sep 2017 11:11:42 -0700
From:   Tim Hockin <thockin@...kin.org>
To:     Roman Gushchin <guro@...com>
Cc:     Michal Hocko <mhocko@...nel.org>,
        Johannes Weiner <hannes@...xchg.org>,
        Tejun Heo <tj@...nel.org>, kernel-team@...com,
        David Rientjes <rientjes@...gle.com>, linux-mm@...ck.org,
        Vladimir Davydov <vdavydov.dev@...il.com>,
        Tetsuo Handa <penguin-kernel@...ove.sakura.ne.jp>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Cgroups <cgroups@...r.kernel.org>, linux-doc@...r.kernel.org,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [v8 0/4] cgroup-aware OOM killer

On Wed, Sep 27, 2017 at 9:23 AM, Roman Gushchin <guro@...com> wrote:
> On Wed, Sep 27, 2017 at 08:35:50AM -0700, Tim Hockin wrote:
>> On Wed, Sep 27, 2017 at 12:43 AM, Michal Hocko <mhocko@...nel.org> wrote:
>> > On Tue 26-09-17 20:37:37, Tim Hockin wrote:
>> > [...]
>> >> I feel like David has offered examples here, and many of us at Google
>> >> have offered examples as long ago as 2013 (if I recall) of cases where
>> >> the proposed heuristic is EXACTLY WRONG.
>> >
>> > I do not think we have discussed anything resembling the current
>> > approach. And I would really appreciate some more examples where
>> > decisions based on leaf nodes would be EXACTLY WRONG.
>> >
>> >> We need OOM behavior to kill in a deterministic order configured by
>> >> policy.
>> >
>> > And nobody is objecting to this usecase. I think we can build a priority
>> > policy on top of leaf-based decision as well. The main point we are
>> > trying to sort out here is a reasonable semantic that would work for
>> > most workloads. Sibling based selection will simply not work on those
>> > that have to use deeper hierarchies for organizational purposes. I
>> > haven't heard a counter argument for that example yet.
>>
>
> Hi, Tim!
>
>> We have a priority-based, multi-user cluster.  That cluster runs a
>> variety of work, including critical things like search and gmail, as
>> well as non-critical things like batch work.  We try to offer our
>> users an SLA around how often they will be killed by factors outside
>> themselves, but we also want to get higher utilization.  We know for a
>> fact (data, lots of data) that most jobs have spare memory capacity,
>> set aside for spikes or simply because accurate sizing is hard.  We
>> can sell "guaranteed" resources to critical jobs, with a high SLA.  We
>> can sell "best effort" resources to non-critical jobs with a low SLA.
>> We achieve much better overall utilization this way.
>
> This is well understood.
>
>>
>> I need to represent the priority of these tasks in a way that gives me
>> a very strong promise that, in case of system OOM, the non-critical
>> jobs will be chosen before the critical jobs.  Regardless of size.
>> Regardless of how many non-critical jobs have to die.  I'd rather kill
>> *all* of the non-critical jobs than a single critical job.  Size of
>> the process or cgroup is simply not a factor, and honestly given 2
>> options of equal priority I'd say age matters more than size.
>>
>> So concretely I have 2 first-level cgroups, one for "guaranteed" and
>> one for "best effort" classes.  I always want to kill from "best
>> effort", even if that means killing 100 small cgroups, before touching
>> "guaranteed".
>>
>> I apologize if this is not as thorough as the rest of the thread - I
>> am somewhat out of touch with the guts of it all these days.  I just
>> feel compelled to indicate that, as a historical user (via Google
>> systems) and current user (via Kubernetes), some of the assertions
>> being made here do not ring true for our very real use cases.  I
>> desperately want cgroup-aware OOM handing, but it has to be
>> policy-based or it is just not useful to us.
>
> A policy-based approach was suggested by Michal at a very beginning of
> this discussion. Although nobody had any strong objections against it,
> we've agreed that this is out of scope of this patchset.
>
> The idea of this patchset is to introduce an ability to select a memcg
> as an OOM victim with the following optional killing of all belonging tasks.
> I believe, it's absolutely mandatory for _any_ further development
> of the OOM killer, which wants to deal with memory cgroups as OOM entities.
>
> If you think that it makes impossible to support some use cases in the future,
> let's discuss it. Otherwise, I'd prefer to finish this part of the work,
> and proceed to the following improvements on top of it.
>
> Thank you!

I am 100% in favor of killing whole groups.  We want that too.  I just
needed to express disagreement with statements that size-based
decisions could not produce bad results.  They can and do.