linux-kernel - Re: [RFC PATCH 0/5] mm: Select victim memcg using BPF_OOM

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <f8f44103-afba-10ee-b14b-a8e60a7f33d8@bytedance.com>
Date:   Tue, 1 Aug 2023 00:26:20 +0800
From:   Chuyi Zhou <zhouchuyi@...edance.com>
To:     Michal Hocko <mhocko@...e.com>
Cc:     hannes@...xchg.org, roman.gushchin@...ux.dev, ast@...nel.org,
        daniel@...earbox.net, andrii@...nel.org, bpf@...r.kernel.org,
        linux-kernel@...r.kernel.org, wuyun.abel@...edance.com,
        robin.lu@...edance.com, muchun.song@...ux.dev,
        zhengqi.arch@...edance.com
Subject: Re: [RFC PATCH 0/5] mm: Select victim memcg using BPF_OOM_POLICY



在 2023/7/31 21:23, Michal Hocko 写道:
> On Mon 31-07-23 14:00:22, Chuyi Zhou wrote:
>> Hello, Michal
>>
>> 在 2023/7/28 01:23, Michal Hocko 写道:
> [...]
>>> This sounds like a very specific oom policy and that is fine. But the
>>> interface shouldn't be bound to any concepts like priorities let alone
>>> be bound to memcg based selection. Ideally the BPF program should get
>>> the oom_control as an input and either get a hook to kill process or if
>>> that is not possible then return an entity to kill (either process or
>>> set of processes).
>>
>> Here are two interfaces I can think of. I was wondering if you could give me
>> some feedback.
>>
>> 1. Add a new hook in select_bad_process(), we can attach it and return a set
>> of pids or cgroup_ids which are pre-selected by user-defined policy,
>> suggested by Roman. Then we could use oom_evaluate_task to find a final
>> victim among them. It's user-friendly and we can offload the OOM policy to
>> userspace.
>>
>> 2. Add a new hook in oom_evaluate_task() and return a point to override the
>> default oom_badness return-value. The simplest way to use this is to protect
>> certain processes by setting the minimum score.
>>
>> Of course if you have a better idea, please let me know.
> 
> Hooking into oom_evaluate_task seems the least disruptive to the
> existing oom killer implementation. I would start by planing with that
> and see whether useful oom policies could be defined this way. I am not
> sure what is the best way to communicate user input so that a BPF prgram
> can consume it though. The interface should be generic enough that it
> doesn't really pre-define any specific class of policies. Maybe we can
> add something completely opaque to each memcg/task? Does BPF
> infrastructure allow anything like that already?
> 

“Maybe we can add something completely opaque to each memcg/task?”
Sorry, I don't understand what you mean.

I think we probably don't need to expose too much to the user, the 
following might be sufficient:

noinline int bpf_get_score(struct oom_control *oc,
		struct task_struct *task);

static int oom_evaluate_task()
{
	...
	points = bpf_get_score(oc, task);
	if (!check_points_valid(points))
          	points = oom_badness(task, oc->totalpages);
	...
}

There are several reasons:

1. The implementation of use-defined OOM policy, such as iteration, 
sorting and other complex operations, is more suitable to be placed in 
the userspace rather than in the bpf program. It is more convenient to 
implement these operations in userspace in which the useful information 
(memory usage of each task and memcg, memory allocation speed, etc.) can 
also be captured. For example, oomd implements multiple policies[1] 
without kernel-space input.

2. Userspace apps, such as oomd, can import useful information into bpf 
program, e.g., through bpf_map, and update it periodically. For example, 
we can do the scoring directly in userspace and maintain a score hash, 
so that in the bpf program, we only need to look for the corresponding 
score of the process.

Userspace policy（oomd）
          bpf_map_update
          score_hash
       ------------------>  BPF program
                               look up score in
                                score_hash
                             ---------------> kernel space
Just some thoughts.
Thanks!

[1]https://github.com/facebookincubator/oomd/tree/main/src/oomd/plugins)