linux-kernel - Re: [RFC PATCH 1/2] mm, oom: Introduce bpf_select

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1719817f-6ae9-8f0b-5075-657cb4e80e20@oracle.com>
Date:   Fri, 4 Aug 2023 12:34:43 +0100
From:   Alan Maguire <alan.maguire@...cle.com>
To:     Chuyi Zhou <zhouchuyi@...edance.com>, hannes@...xchg.org,
        mhocko@...nel.org, roman.gushchin@...ux.dev, ast@...nel.org,
        daniel@...earbox.net, andrii@...nel.org, muchun.song@...ux.dev
Cc:     bpf@...r.kernel.org, linux-kernel@...r.kernel.org,
        wuyun.abel@...edance.com, robin.lu@...edance.com
Subject: Re: [RFC PATCH 1/2] mm, oom: Introduce bpf_select_task

On 04/08/2023 10:38, Chuyi Zhou wrote:
> This patch adds a new hook bpf_select_task in oom_evaluate_task. It

bpf_select_task() feels like too generic a name - bpf_oom_select_task()
might make the context clearer.

I'd also suggest adding a documentation patch for a new
Documentation/bpf/oom.rst or whatever to describe how it is all supposed
to work.

> takes oc and current iterating task as parameters and returns a result
> indicating which one is selected by bpf program.
> 
> Although bpf_select_task is used to bypass the default method, there are
> some existing rules should be obeyed. Specifically, we skip these
> "unkillable" tasks(e.g., kthread, MMF_OOM_SKIP, in_vfork()).So we do not
> consider tasks with lowest score returned by oom_badness except it was
> caused by OOM_SCORE_ADJ_MIN.
> 
> If we attach a prog to the hook, the interface is enabled only when we have
> successfully chosen at least one valid candidate in previous iteraion. This
> is to avoid that we find nothing if bpf program rejects all tasks.
> 

I don't know anything about OOM mechanisms, so maybe it's just me, but I
found this confusing. Relying on the previous iteration to control
current iteration behaviour seems risky - even if BPF found a victim in
iteration N, it's no guarantee it will in iteration N+1.

Naively I would have thought the right answer here would be to honour
the choice OOM would have made (in the absence of BPF execution) for
cases where BPF did not select a victim. Is that sort of scheme
workable? Does that make sense from the mm side, or would we actually
want to fall back to

	pr_warn("Out of memory and no killable processes...\n");

...if BPF didn't select a process?

The danger here seems to be that the current non-BPF mechanism seems to
be guaranteed to find a chosen victim, but delegating to BPF is not. So
what is the right behaviour for such cases from the mm perspective?

(One thing that would probably be worth doing from the BPF side would be
to add a tracepoint to mark the scenario where nothing was chosen for
OOM kill via BPF; this would allow BPF programs to catch the fact that
their OOM selection mechanisms didn't work.)

Alan

> Signed-off-by: Chuyi Zhou <zhouchuyi@...edance.com>
> ---
>  mm/oom_kill.c | 57 ++++++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 50 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 612b5597d3af..aec4c55ed49a 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -18,6 +18,7 @@
>   *  kernel subsystems and hints as to where to find out what things do.
>   */
>  
> +#include <linux/bpf.h>
>  #include <linux/oom.h>
>  #include <linux/mm.h>
>  #include <linux/err.h>
> @@ -210,6 +211,16 @@ long oom_badness(struct task_struct *p, unsigned long totalpages)
>  	if (!p)
>  		return LONG_MIN;
>  
> +	/*
> +	 * If task is allocating a lot of memory and has been marked to be
> +	 * killed first if it triggers an oom, then set points to LONG_MAX.
> +	 * It will be selected unless we keep oc->chosen through bpf interface.
> +	 */
> +	if (oom_task_origin(p)) {
> +		task_unlock(p);
> +		return LONG_MAX;
> +	}
> +
>  	/*
>  	 * Do not even consider tasks which are explicitly marked oom
>  	 * unkillable or have been already oom reaped or the are in
> @@ -305,8 +316,30 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
>  	return CONSTRAINT_NONE;
>  }
>  
> +enum bpf_select_ret {
> +	BPF_SELECT_DISABLE,
> +	BPF_SELECT_TASK,
> +	BPF_SELECT_CHOSEN,
> +};
> +
> +__weak noinline int bpf_select_task(struct oom_control *oc,
> +				struct task_struct *task, long badness_points)
> +{
> +	return BPF_SELECT_DISABLE;
> +}
> +
> +BTF_SET8_START(oom_bpf_fmodret_ids)
> +BTF_ID_FLAGS(func, bpf_select_task)
> +BTF_SET8_END(oom_bpf_fmodret_ids)
> +
> +static const struct btf_kfunc_id_set oom_bpf_fmodret_set = {
> +	.owner = THIS_MODULE,
> +	.set   = &oom_bpf_fmodret_ids,
> +};
> +
>  static int oom_evaluate_task(struct task_struct *task, void *arg)
>  {
> +	enum bpf_select_ret bpf_ret = BPF_SELECT_DISABLE;
>  	struct oom_control *oc = arg;
>  	long points;
>  
> @@ -329,17 +362,23 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
>  		goto abort;
>  	}
>  
> +	points = oom_badness(task, oc->totalpages);
> +
>  	/*
> -	 * If task is allocating a lot of memory and has been marked to be
> -	 * killed first if it triggers an oom, then select it.
> +	 * Do not consider tasks with lowest score value except it was caused
> +	 * by OOM_SCORE_ADJ_MIN. Give these tasks a chance to be selected by
> +	 * bpf interface.
>  	 */
> -	if (oom_task_origin(task)) {
> -		points = LONG_MAX;
> +	if (points == LONG_MIN && task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
> +		goto next;
> +
> +	if (oc->chosen)
> +		bpf_ret = bpf_select_task(oc, task, points);
> +
> +	if (bpf_ret == BPF_SELECT_TASK)
>  		goto select;
> -	}
>  
> -	points = oom_badness(task, oc->totalpages);
> -	if (points == LONG_MIN || points < oc->chosen_points)
> +	if (bpf_ret == BPF_SELECT_CHOSEN || points == LONG_MIN || points < oc->chosen_points)
>  		goto next;
>  
>  select:
> @@ -732,10 +771,14 @@ static struct ctl_table vm_oom_kill_table[] = {
>  
>  static int __init oom_init(void)
>  {
> +	int err;
>  	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
>  #ifdef CONFIG_SYSCTL
>  	register_sysctl_init("vm", vm_oom_kill_table);
>  #endif

probably worth having #ifdef CONFIG_BPF or similar here..

> +	err = register_btf_fmodret_id_set(&oom_bpf_fmodret_set);
> +	if (err)
> +		pr_warn("error while registering oom fmodret entrypoints: %d", err);
>  	return 0;
>  }
>  subsys_initcall(oom_init)