linux-kernel - Re: [PATCH v3] mm/oom: do not oom reap task with an unresolved robust futex

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <118fc685-c68d-614f-006a-7d5487302122@redhat.com>
Date:   Mon, 14 Feb 2022 15:39:31 -0500
From:   Nico Pache <npache@...hat.com>
To:     Michal Hocko <mhocko@...e.com>
Cc:     Waiman Long <longman@...hat.com>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, akpm@...ux-foundation.org,
        jsavitz@...hat.com, peterz@...radead.org, tglx@...utronix.de,
        mingo@...hat.com, dvhart@...radead.org, dave@...olabs.net,
        andrealmeid@...labora.com
Subject: Re: [PATCH v3] mm/oom: do not oom reap task with an unresolved robust
 futex



On 1/18/22 03:51, Michal Hocko wrote:
> On Mon 17-01-22 17:56:28, Nico Pache wrote:
>> On 1/17/22 11:05, Waiman Long wrote:
>>> On 1/17/22 03:52, Michal Hocko wrote:
> [...]
>>>>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>>>>> index 1ddabefcfb5a..3cdaac9c7de5 100644
>>>>> --- a/mm/oom_kill.c
>>>>> +++ b/mm/oom_kill.c
>>>>> @@ -667,6 +667,21 @@ static void wake_oom_reaper(struct task_struct *tsk)
>>>>>       if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
>>>>>           return;
>>>>>   +#ifdef CONFIG_FUTEX
>>>>> +    /*
>>>>> +     * If the ooming task's SIGKILL has not finished handling the
>>>>> +     * robust futex it is not correct to reap the mm concurrently.
>>>>> +     * Do not wake the oom reaper when the task still contains a
>>>>> +     * robust list.
>>>>> +     */
>>>>> +    if (tsk->robust_list)
>>>>> +        return;
>>>>> +#ifdef CONFIG_COMPAT
>>>>> +    if (tsk->compat_robust_list)
>>>>> +        return;
>>>>> +#endif
>>>>> +#endif
>>>> If this turns out to be really needed, which I do not really see at the
>>>> moment, then this is not the right way to handle this situation. The oom
>>>> victim could get stuck and the oom killer wouldn't be able to move
>>>> forward. If anything the victim would need to get MMF_OOM_SKIP set.
>>
>> I will try this, but I don't immediately see any difference between this return
>> case and setting the bit, passing the oom_reaper_list, then skipping it based on
>> the flag. Do you mind explaining how this could lead to the oom killer getting
>> stuck?
> 
> The primary purpose of the oom_reaper is to guarantee a forward
> progress. If a task gets stuck in the kernel - e.g. on locks then it
> won't bail out and won't handle signals (i.e. SIGKILL from the
> userspace). The oom killer prevents new oom victims selection in a
> presence of an existing oom victim (see oom_evaluate_task). That means
> that we not only send a SIGKILL to the victim, we also wake up the oom
> reaper which then asynchronously tears down the private memory of the
> task (thus release at least some of its memory) and once it is done it
> will set MMF_OOM_SKIP flag which will tell the oom killer
> (oom_evaluate_task) that this victim is no longer interesting and a new
> victim can be selected.
> 
> Makes sense?
Thank does, Thank you for clearing that up!
> 
> Part of the async tear down is also MMF_UNSTABLE handling (see
> __oom_reap_task_mm) which tells #PF handling code
> (check_stable_address_space) that the underlying memory could have been
> tempered with and thus it should return SIGBUS. The underlying
> assumption is that the process (and all tasks which share its mm) has
> been killed and it will never return to the userspace so the de-facto
> memory corruption doesn't matter.
> 
> One thing that is still unclear to me is why this leads to any locked up
> tasks. Looking at exit_robust_list I can see that it is accessing the
> userspace memory but this should return EFAULT in this situation. My
> assumption (which might be really wrong) is that futex shared among
> processes which are not sharing mm nor signal handling will be sitting
> in a shared memory. 
> 
> Now to the actual fix. I do not think we want to hide the task from the
> oom reaper as you are suggesting. Futexes are very likely to be used for
> many processes and that would make the whole async scheme useless. We
> need something like the below.
> 
> futex_exit_release is not directly usable as it implicitly depends
> on memory allocation (#PF) and that is not acceptable. So instead we
> need something futex_exit_try_release or similar which would fail the
> operation in case get_user (pagefault_disable) needs to really handle
> the #PF or if the futex_exit_mutex is locked. In other words this would
> have to be a completely non-blocking operation. The oom reaper would
> then bail out.
Yeah Joel came up with a similar idea once we realized his v2 had the issue with
sleeping.

We've recently been discussing the following if statement in __oom_reap_task_mm:
	if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))

Given the comment above it, and some of the upstream discussion the original
RFC, we are struggling to see why this should be a `||` and not an `&&`. If we
only want to reap anon memory and reaping shared memory can be dangerous is this
statement incorrect?

We have a patch queued up to make this change, but wanted to get your opinion on
why this was originally designed this way in case we are missing something.

-- Nico

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index ef5860fc7d22..57660d3d1b79 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -613,6 +613,9 @@ static void oom_reap_task(struct task_struct *tsk)
>  	int attempts = 0;
>  	struct mm_struct *mm = tsk->signal->oom_mm;
>  
> +	if (futex_exit_try_release(tsk))
> +		goto fail;
> +
>  	/* Retry the mmap_read_trylock(mm) a few times */
>  	while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm))
>  		schedule_timeout_idle(HZ/10);
> @@ -621,6 +624,7 @@ static void oom_reap_task(struct task_struct *tsk)
>  	    test_bit(MMF_OOM_SKIP, &mm->flags))
>  		goto done;
>  
> +fail:
>  	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
>  		task_pid_nr(tsk), tsk->comm);
>  	sched_show_task(tsk);
> @@ -1184,6 +1188,11 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
>  	if (!reap)
>  		goto drop_mm;
>  
> +	if (futex_exit_try_release(tsk)) {
> +		ret = -EAGAIN;
> +		goto drop_mm;
> +	}
> +	
>  	if (mmap_read_lock_killable(mm)) {
>  		ret = -EINTR;
>  		goto drop_mm;