linux-kernel - Re: [RFC PATCH 1/2] mm, oom: Introduce bpf_select

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZNCsIm+RK0LStUA6@dhcp22.suse.cz>
Date:   Mon, 7 Aug 2023 10:32:34 +0200
From:   Michal Hocko <mhocko@...e.com>
To:     Chuyi Zhou <zhouchuyi@...edance.com>
Cc:     Alan Maguire <alan.maguire@...cle.com>, hannes@...xchg.org,
        roman.gushchin@...ux.dev, ast@...nel.org, daniel@...earbox.net,
        andrii@...nel.org, muchun.song@...ux.dev, bpf@...r.kernel.org,
        linux-kernel@...r.kernel.org, wuyun.abel@...edance.com,
        robin.lu@...edance.com
Subject: Re: [RFC PATCH 1/2] mm, oom: Introduce bpf_select_task

On Sat 05-08-23 07:55:56, Chuyi Zhou wrote:
> Hello,
> 
> 在 2023/8/4 19:34, Alan Maguire 写道:
[...]
> > I don't know anything about OOM mechanisms, so maybe it's just me, but I
> > found this confusing. Relying on the previous iteration to control
> > current iteration behaviour seems risky - even if BPF found a victim in
> > iteration N, it's no guarantee it will in iteration N+1.
> > 
> The current kernel's OOM actually works like this:
> 
> 1. if we first find a valid candidate victim A in iteration N, we would
> record it in oc->chosen.
> 
> 2. In iteration N + 1, N+2..., we just compare oc->chosen with the current
> iterating task. Suppose we think current task B is better than
> oc->chosen(A), we would set oc->chosen = B and we would not consider A
> anymore.
> 
> IIUC, most policy works like this. We just need to find the *most* suitable
> victim. Normally, if in current iteration we drop A and select B, we would
> not consider A anymore.

Yes, we iterate over all tasks in the specific oom domain (all tasks for
global and all members of memcg tree for hard limit oom). The in-tree
oom policy has to iterate all tasks to achieve some of its goals (like
preventing overkilling while the previously selected victim is still on
the way out). Also oom_score_adj might change the final decision so you
have to really check all eligible tasks.

I can imagine a BPF based policy could be less constrained and as Roman
suggested have a pre-selected victims on stand by. I do not see problem
to have break like mode. Similar to current abort without a canceling an
already noted victim.
-- 
Michal Hocko
SUSE Labs