linux-kernel - Re: [RFC PATCH v2 1/5] mm, oom: Introduce bpf_oom_evaluate

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZOSQYrFb2xleB83o@dhcp22.suse.cz>
Date:   Tue, 22 Aug 2023 12:39:30 +0200
From:   Michal Hocko <mhocko@...e.com>
To:     Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc:     Chuyi Zhou <zhouchuyi@...edance.com>,
        Johannes Weiner <hannes@...xchg.org>,
        Roman Gushchin <roman.gushchin@...ux.dev>,
        Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        Andrii Nakryiko <andrii@...nel.org>, muchun.song@...ux.dev,
        bpf <bpf@...r.kernel.org>, LKML <linux-kernel@...r.kernel.org>,
        wuyun.abel@...edance.com, robin.lu@...edance.com
Subject: Re: [RFC PATCH v2 1/5] mm, oom: Introduce bpf_oom_evaluate_task

On Wed 16-08-23 19:07:10, Alexei Starovoitov wrote:
> On Thu, Aug 10, 2023 at 1:13 AM Chuyi Zhou <zhouchuyi@...edance.com> wrote:
> >  static int oom_evaluate_task(struct task_struct *task, void *arg)
> >  {
> >         struct oom_control *oc = arg;
> > @@ -317,6 +339,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
> >         if (!is_memcg_oom(oc) && !oom_cpuset_eligible(task, oc))
> >                 goto next;
> >
> > +       /*
> > +        * If task is allocating a lot of memory and has been marked to be
> > +        * killed first if it triggers an oom, then select it.
> > +        */
> > +       if (oom_task_origin(task)) {
> > +               points = LONG_MAX;
> > +               goto select;
> > +       }
> > +
> > +       switch (bpf_oom_evaluate_task(task, oc)) {
> > +       case BPF_EVAL_ABORT:
> > +               goto abort; /* abort search process */
> > +       case BPF_EVAL_NEXT:
> > +               goto next; /* ignore the task */
> > +       case BPF_EVAL_SELECT:
> > +               goto select; /* select the task */
> > +       default:
> > +               break; /* No BPF policy */
> > +       }
> > +
> 
> I think forcing bpf prog to look at every task is going to be limiting
> long term.
> It's more flexible to invoke bpf prog from out_of_memory()
> and if it doesn't choose a task then fallback to select_bad_process().
> I believe that's what Roman was proposing.
> bpf can choose to iterate memcg or it might have some side knowledge
> that there are processes that can be set as oc->chosen right away,
> so it can skip the iteration.

This is certainly possible but I am worried this will lead to a lot of
duplication. There are common tasks that all/most oom victim selection
implementations should do. First of all they should make sure that the
victim is belonging to the oom domain. Arguably it should be also aware
of ongoing oom victim tear down to prevent from overkilling. Proper oom
victim reference counting handling. Most people are not even aware of
those things. Do we really want all those to be re-invented - most
likely incorrectly?

Advantage of reusing oom_evaluate_task is that all that can be avoided.
Iterating over tasks with a pre-defined oom-victim sure sounds like
unnecessary and wasted CPU cycles but if you want to prevent
over-killing this is still necessary. As the oom killer can be invoked
really rapidly (before the victim has a chance to die) I believe this is
a very useful feature.
-- 
Michal Hocko
SUSE Labs