linux-kernel - Re: [PATCH 14/23] userfaultfd: wake pending userfaults

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20151022132015.GF19147@redhat.com>
Date:	Thu, 22 Oct 2015 15:20:15 +0200
From:	Andrea Arcangeli <aarcange@...hat.com>
To:	Peter Zijlstra <peterz@...radead.org>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org,
	qemu-devel@...gnu.org, kvm@...r.kernel.org,
	linux-api@...r.kernel.org, Pavel Emelyanov <xemul@...allels.com>,
	Sanidhya Kashyap <sanidhya.gatech@...il.com>,
	zhang.zhanghailiang@...wei.com,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	"Kirill A. Shutemov" <kirill@...temov.name>,
	Andres Lagar-Cavilla <andreslc@...gle.com>,
	Dave Hansen <dave.hansen@...el.com>,
	Paolo Bonzini <pbonzini@...hat.com>,
	Rik van Riel <riel@...hat.com>, Mel Gorman <mgorman@...e.de>,
	Andy Lutomirski <luto@...capital.net>,
	Hugh Dickins <hughd@...gle.com>,
	Peter Feiner <pfeiner@...gle.com>,
	"Dr. David Alan Gilbert" <dgilbert@...hat.com>,
	Johannes Weiner <hannes@...xchg.org>,
	"Huangpeng (Peter)" <peter.huangpeng@...wei.com>
Subject: Re: [PATCH 14/23] userfaultfd: wake pending userfaults

On Thu, Oct 22, 2015 at 02:10:56PM +0200, Peter Zijlstra wrote:
> On Thu, May 14, 2015 at 07:31:11PM +0200, Andrea Arcangeli wrote:
> > @@ -255,21 +259,23 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
> >  	 * through poll/read().
> >  	 */
> >  	__add_wait_queue(&ctx->fault_wqh, &uwq.wq);
> > -	for (;;) {
> > -		set_current_state(TASK_KILLABLE);
> > -		if (!uwq.pending || ACCESS_ONCE(ctx->released) ||
> > -		    fatal_signal_pending(current))
> > -			break;
> > -		spin_unlock(&ctx->fault_wqh.lock);
> > +	set_current_state(TASK_KILLABLE);
> > +	spin_unlock(&ctx->fault_wqh.lock);
> >  
> > +	if (likely(!ACCESS_ONCE(ctx->released) &&
> > +		   !fatal_signal_pending(current))) {
> >  		wake_up_poll(&ctx->fd_wqh, POLLIN);
> >  		schedule();
> > +		ret |= VM_FAULT_MAJOR;
> > +	}
> 
> So what happens here if schedule() spontaneously wakes for no reason?
> 
> I'm not sure enough of userfaultfd semantics to say if that would be
> bad, but the code looks suspiciously like it relies on schedule() not to
> do that; which is wrong.

That would repeat the fault and trigger the DEBUG_VM printk above,
complaining that FAULT_FLAG_ALLOW_RETRY is not set. It is only a
problem for kernel faults (copy_user/get_user_pages*). Userland won't
error out in such a way because userland would return 0 and not
VM_FAULT_RETRY. So it's only required when the task schedule in
TASK_KILLABLE state.

If schedule spontaneously wakes up a task in TASK_KILLABLE state that
would be a bug in the scheduler in my view. Luckily there doesn't seem
to be such a bug, or at least we never experienced it.

Overall this dependency on the scheduler will be lifted soon, as it
must be lifted in order to track the write protect faults, so longer
term this is not a concern and this is not a design issue, but this
remains an implementation detail that avoided to change the arch code
and gup.

If you send a reproducer to show how the current scheduler can wake up
the task in TASK_KILLABLE despite not receiving a wakeup, that would
help too as we never experienced that.

The reason this dependency will be lifted soon is that the userfaultfd
write protect tracking may be armed at any time while the app is
running so we may already be in the middle of a page fault that
returned VM_FAULT_RETRY by the time we arm the write protect
tracking. So longer term the arch page fault and
__get_user_pages_locked must allow handle_userfault() to return
VM_FAULT_RETRY even if FAULT_FLAG_TRIED is set. Then we don't care if
the task is waken.

Waking a task in TASK_KILLABLE state it will still waste CPU so the
scheduler still shouldn't do that. All load balancing works better if
the task isn't running anyway so I can't imagine a good reason for
wanting to run a task in TASK_KILLABLE state before it gets the
wakeup.

Trying to predict that a wakeup is always happening in less time than
it takes to schedule the task out of the CPU, sounds a very CPU
intensive thing to measure and it's probably better to leave those
heuristics in the caller like by spinning on a lock for a while before
blocking.

Thanks,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/