linux-kernel - Re: [RFC PATCH 0/5] mm: Select victim memcg using BPF_OOM

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <eb764131-6d2f-c088-5481-99d605a67349@bytedance.com>
Date:   Mon, 31 Jul 2023 14:00:22 +0800
From:   Chuyi Zhou <zhouchuyi@...edance.com>
To:     Michal Hocko <mhocko@...e.com>
Cc:     hannes@...xchg.org, roman.gushchin@...ux.dev, ast@...nel.org,
        daniel@...earbox.net, andrii@...nel.org, bpf@...r.kernel.org,
        linux-kernel@...r.kernel.org, wuyun.abel@...edance.com,
        robin.lu@...edance.com, muchun.song@...ux.dev,
        zhengqi.arch@...edance.com
Subject: Re: [RFC PATCH 0/5] mm: Select victim memcg using BPF_OOM_POLICY

Hello, Michal

在 2023/7/28 01:23, Michal Hocko 写道:
> On Thu 27-07-23 20:12:01, Chuyi Zhou wrote:
>>
>>
>> 在 2023/7/27 16:15, Michal Hocko 写道:
>>> On Thu 27-07-23 15:36:27, Chuyi Zhou wrote:
>>>> This patchset tries to add a new bpf prog type and use it to select
>>>> a victim memcg when global OOM is invoked. The mainly motivation is
>>>> the need to customizable OOM victim selection functionality so that
>>>> we can protect more important app from OOM killer.
>>>
>>> This is rather modest to give an idea how the whole thing is supposed to
>>> work. I have looked through patches very quickly but there is no overall
>>> design described anywhere either.
>>>
>>> Please could you give us a high level design description and reasoning
>>> why certain decisions have been made? e.g. why is this limited to the
>>> global oom sitation, why is the BPF program forced to operate on memcgs
>>> as entities etc...
>>> Also it would be very helpful to call out limitations of the BPF
>>> program, if there are any.
>>>
>>> Thanks!
>>
>> Hi,
>>
>> Thanks for your advice.
>>
>> The global/memcg OOM victim selection uses process as the base search
>> granularity. However, we can see a need for cgroup level protection and
>> there's been some discussion[1]. It seems reasonable to consider using memcg
>> as a search granularity in victim selection algorithm.
> 
> Yes, it can be reasonable for some policies but making it central to the
> design is very limiting.
> 
>> Besides, it seems pretty well fit for offloading policy decisions to a BPF
>> program, since BPF is scalable and flexible. That's why the new BPF
>> program operate on memcgs as entities.
> 
> I do not follow your line of argumentation here. The same could be
> argued for processes or beans.
> 
>> The idea is to let user choose which leaf in the memcg tree should be
>> selected as the victim. At the first layer, if we choose A, then it protects
>> the memcg under the B, C, and D subtrees.
>>
>>          root
>>       /   ｜ \  \
>>      A    B  C  D
>>     /\
>>    E F
>>
>>
>> Using the BPF prog, we are allowed to compare the OOM priority between
>> two siblings so that we can choose the best victim in each layer.
> 
> How is the priority defined and communicated to the userspace.
> 
>> For example:
>>
>> run_prog(B, C) -> choose B
>> run_prog(B, D) -> choose D
>> run_prog(A, D) -> choose A
>>
>> Once we select A as the victim in the first layer, the victim in next layer
>> would be selected among A's children. Finally, we select a leaf memcg as
>> victim.
> 
> This sounds like a very specific oom policy and that is fine. But the
> interface shouldn't be bound to any concepts like priorities let alone
> be bound to memcg based selection. Ideally the BPF program should get
> the oom_control as an input and either get a hook to kill process or if
> that is not possible then return an entity to kill (either process or
> set of processes).

Here are two interfaces I can think of. I was wondering if you could 
give me some feedback.

1. Add a new hook in select_bad_process(), we can attach it and return a 
set of pids or cgroup_ids which are pre-selected by user-defined policy, 
  suggested by Roman. Then we could use oom_evaluate_task to find a 
final victim among them. It's user-friendly and we can offload the OOM 
policy to userspace.

2. Add a new hook in oom_evaluate_task() and return a point to override 
the default oom_badness return-value. The simplest way to use this is to 
protect certain processes by setting the minimum score.

Of course if you have a better idea, please let me know.

Thanks!
---
Chuyi Zhou
> 
>> In our scenarios, the impact caused by global OOM's is much more common, so
>> we only considered global in this patchset. But it seems that the idea can
>> also be applied to memcg OOM.
> 
> The global and memcg OOMs shouldn't have a different interface. If a
> specific BPF program wants to implement a different policy for global
> vs. memcg OOM then be it but this should be a decision of the said
> program not an inherent limitation of the interface.
> 
>>
>> [1]https://lore.kernel.org/lkml/ZIgodGWoC%2FR07eak@dhcp22.suse.cz/
>>
>> Thanks!
>> -- 
>> Chuyi Zhou
>