linux-kernel - Re: [PATCH 07/10] mm, oom: fortify task_will_free

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20160617132903.GJ21670@dhcp22.suse.cz>
Date:	Fri, 17 Jun 2016 15:29:04 +0200
From:	Michal Hocko <mhocko@...nel.org>
To:	Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>
Cc:	linux-mm@...ck.org, rientjes@...gle.com, oleg@...hat.com,
	vdavydov@...allels.com, akpm@...ux-foundation.org,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 07/10] mm, oom: fortify task_will_free_mem

On Fri 17-06-16 22:12:22, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 17-06-16 20:38:01, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > > > Anyway, would you be OK with the patch if I added the current->mm check
> > > > > > and resolve its necessity in a separate patch?
> > > > > 
> > > > > Please correct task_will_free_mem() in oom_kill_process() as well.
> > > > 
> > > > We cannot hold task_lock over all task_will_free_mem I am even not sure
> > > > we have to develop an elaborate way to make it raceless just for the nommu
> > > > case. The current case is simple as we cannot race here. Is that
> > > > sufficient for you?
> > > 
> > > We can use find_lock_task_mm() inside mark_oom_victim().
> > > That is, call wake_oom_reaper() from mark_oom_victim() like
> > > 
> > > void mark_oom_victim(struct task_struct *tsk, bool can_use_oom_reaper)
> > > {
> > > 	WARN_ON(oom_killer_disabled);
> > > 	/* OOM killer might race with memcg OOM */
> > > 	tsk = find_lock_task_mm(tsk);
> > > 	if (!tsk)
> > > 		return;
> > > 	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE)) {
> > > 		task_unlock(tsk);
> > > 		return;
> > > 	}
> > > 	task_unlock(tsk);
> > > 	atomic_inc(&tsk->signal->oom_victims);
> > > 	/*
> > > 	 * Make sure that the task is woken up from uninterruptible sleep
> > > 	 * if it is frozen because OOM killer wouldn't be able to free
> > > 	 * any memory and livelock. freezing_slow_path will tell the freezer
> > > 	 * that TIF_MEMDIE tasks should be ignored.
> > > 	 */
> > > 	__thaw_task(tsk);
> > > 	atomic_inc(&oom_victims);
> > > 	if (can_use_oom_reaper)
> > > 		wake_oom_reaper(tsk);
> > > }
> > > 
> > > and move mark_oom_victim() by normal path to after task_unlock(victim).
> > > 
> > >  	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
> > > -	mark_oom_victim(victim);
> > > 
> > > -	if (can_oom_reap)
> > > -		wake_oom_reaper(victim);
> > > +	wake_oom_reaper(victim, can_oom_reap);
> > 
> > I do not like this because then we would have to check the reapability
> > from inside the oom_reaper again.
> 
> I didn't understand why you think so. But strictly speaking, can_oom_reap calculation
> in oom_kill_process() is always racy, and [PATCH 10/10] is not safe.
> 
>   CPU0 (memory allocating task)       CPU1 (kthread)                    CPU2 (OOM victim)
> 
>                                       Calls use_mm(victim->mm).
>                                       Starts some worker.
>   Enters out_of_memory().
>   Enters oom_kill_process().
>                                       Finishes some worker.
>   Calls rcu_read_lock().
>   Sets can_oom_reap = false due to process_shares_mm() && !same_thread_group() && (p->flags & PF_KTHREAD).
>                                       Calls unuse_mm(victim->mm).
>   Continues scanning other processes.
>                                       Calls mmput(victim->mm).
>   Sends SIGKILL to victim.
>   Calls rcu_read_unlock().
>   Leaves oom_kill_process().
>                                                                         Calls do_exit().
>   Leaves out_of_memory().
>                                                                         Sets victim->mm = NULL from exit_mm().
>                                                                         Calls mmput() from exit_mm().
>                                                                         __mmput() is called because victim was the last user.
>   Enters out_of_memory().
>   oom_scan_process_thread() returns OOM_SCAN_ABORT.
>   Leaves out_of_memory().
>                                                                         __mmput() stalls but the oom_reaper is not called.
> 
> For correctness, can_oom_reap needs to be calculated inside the oom_reaper.

Why it would be any less racy than the above? It doesn't employ any
serialization with use_mm users nor it serialize with the exit path.
The timing would get different but not in the way to talk about
correctness.

> > But let me ask again. Does this really matter so much just because of
> > nommu where we can fall in different traps? Can we simply focus on mmu
> > (aka vast majority of cases) make it work reliably and see what we can
> > do with nommu later?
> 
> To me, timeout based one is sufficient for handling any traps that hit
> nommu kernels after the OOM killer is invoked. 
> 
> Anyway, I don't like this series because this series ignores
> theoretical cases.

I am pretty sure you would end up in the land of surprises and new
classes of races with any timeout based solutions as well. But we've
been through that discussion already.

> I can't make progress as long as you repeat "does it really matter/occur".
> Please go ahead without Reviewed-by: or Acked-by: from me.

Fair enough. I appreciate your review which has caught many real bugs
and subtle issues!

I will repost the series on Monday.

-- 
Michal Hocko
SUSE Labs