linux-kernel - Re: [PATCH] kvm/x86: Handle async PF in RCU read-side critical sections

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20170930171515.GK3521@linux.vnet.ibm.com>
Date:   Sat, 30 Sep 2017 10:15:15 -0700
From:   "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To:     Boqun Feng <boqun.feng@...il.com>
Cc:     Paolo Bonzini <pbonzini@...hat.com>, linux-kernel@...r.kernel.org,
        kvm@...r.kernel.org, Peter Zijlstra <peterz@...radead.org>,
        Radim Krčmář <rkrcmar@...hat.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>,
        "H. Peter Anvin" <hpa@...or.com>, x86@...nel.org
Subject: Re: [PATCH] kvm/x86: Handle async PF in RCU read-side critical
 sections

On Sat, Sep 30, 2017 at 07:41:56AM +0800, Boqun Feng wrote:
> On Fri, Sep 29, 2017 at 04:43:39PM +0000, Paul E. McKenney wrote:
> > On Fri, Sep 29, 2017 at 04:53:57PM +0200, Paolo Bonzini wrote:
> > > On 29/09/2017 13:01, Boqun Feng wrote:
> > > > Sasha Levin reported a WARNING:
> > > > 
> > > > | WARNING: CPU: 0 PID: 6974 at kernel/rcu/tree_plugin.h:329
> > > > | rcu_preempt_note_context_switch kernel/rcu/tree_plugin.h:329 [inline]
> > > > | WARNING: CPU: 0 PID: 6974 at kernel/rcu/tree_plugin.h:329
> > > > | rcu_note_context_switch+0x16c/0x2210 kernel/rcu/tree.c:458
> > > > ...
> > > > | CPU: 0 PID: 6974 Comm: syz-fuzzer Not tainted 4.13.0-next-20170908+ #246
> > > > | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> > > > | 1.10.1-1ubuntu1 04/01/2014
> > > > | Call Trace:
> > > > ...
> > > > | RIP: 0010:rcu_preempt_note_context_switch kernel/rcu/tree_plugin.h:329 [inline]
> > > > | RIP: 0010:rcu_note_context_switch+0x16c/0x2210 kernel/rcu/tree.c:458
> > > > | RSP: 0018:ffff88003b2debc8 EFLAGS: 00010002
> > > > | RAX: 0000000000000001 RBX: 1ffff1000765bd85 RCX: 0000000000000000
> > > > | RDX: 1ffff100075d7882 RSI: ffffffffb5c7da20 RDI: ffff88003aebc410
> > > > | RBP: ffff88003b2def30 R08: dffffc0000000000 R09: 0000000000000001
> > > > | R10: 0000000000000000 R11: 0000000000000000 R12: ffff88003b2def08
> > > > | R13: 0000000000000000 R14: ffff88003aebc040 R15: ffff88003aebc040
> > > > | __schedule+0x201/0x2240 kernel/sched/core.c:3292
> > > > | schedule+0x113/0x460 kernel/sched/core.c:3421
> > > > | kvm_async_pf_task_wait+0x43f/0x940 arch/x86/kernel/kvm.c:158
> > > > | do_async_page_fault+0x72/0x90 arch/x86/kernel/kvm.c:271
> > > > | async_page_fault+0x22/0x30 arch/x86/entry/entry_64.S:1069
> > > > | RIP: 0010:format_decode+0x240/0x830 lib/vsprintf.c:1996
> > > > | RSP: 0018:ffff88003b2df520 EFLAGS: 00010283
> > > > | RAX: 000000000000003f RBX: ffffffffb5d1e141 RCX: ffff88003b2df670
> > > > | RDX: 0000000000000001 RSI: dffffc0000000000 RDI: ffffffffb5d1e140
> > > > | RBP: ffff88003b2df560 R08: dffffc0000000000 R09: 0000000000000000
> > > > | R10: ffff88003b2df718 R11: 0000000000000000 R12: ffff88003b2df5d8
> > > > | R13: 0000000000000064 R14: ffffffffb5d1e140 R15: 0000000000000000
> > > > | vsnprintf+0x173/0x1700 lib/vsprintf.c:2136
> > > > | sprintf+0xbe/0xf0 lib/vsprintf.c:2386
> > > > | proc_self_get_link+0xfb/0x1c0 fs/proc/self.c:23
> > > > | get_link fs/namei.c:1047 [inline]
> > > > | link_path_walk+0x1041/0x1490 fs/namei.c:2127
> > > > ...
> > > > 
> > > > And this happened when we hit a page fault in an RCU read-side critical
> > > > section and then we tried to reschedule in kvm_async_pf_task_wait(),
> > > > this reschedule would hit the WARN in rcu_preempt_note_context_switch(),
> > > > and be treated as a sleep in RCU read-side critical section, which is
> > > > not allowed(even in preemptible RCU).
> > > 
> > > Just a small fix to the commit message:
> > > 
> > > This happened when the host hit a page fault, and delivered it as in an
> > > async page fault, while the guest was in an RCU read-side critical
> > > section.  The guest then tries to reschedule in kvm_async_pf_task_wait(),
> > > but rcu_preempt_note_context_switch() would treat the reschedule as a
> > > sleep in RCU read-side critical section, which is not allowed (even in
> > > preemptible RCU).  Thus the WARN.
> > > 
> > > Queued with that change, thanks.
> > 
> > Not to be repetitive, but if the schedule() is on the guest, this change
> > really does silently break up an RCU read-side critical section on
> > guests built with PREEMPT=n.  (Yes, they were already being broken,
> > but it would be good to avoid this breakage in PREEMPT=n as well as
> > in PREEMPT=y.)
> > 
> 
> Then probably adding !IS_ENABLED(CONFIG_PREEMPT) as one of the reason we
> choose the halt path? Like:
> 
> 	n.halted = is_idle_task(current) || preempt_count() > 1 ||
> 		   !IS_ENABLED(CONFIG_PREEMPT) || rcu_preempt_depth();
> 
> 
> But I think async PF could also happen while a user program is running?
> Then maybe add a second parameter @user for kvm_async_pf_task_wait(),
> like:
> 
> 	kvm_async_pf_task_wait((u32)read_cr2(), user_mode(regs));
> 
> and the halt condition becomes:
> 
> 	n.halted = is_idle_task(current) || preempt_count() > 1 ||
> 		   (!IS_ENABLED(CONFIG_PREEMPT) && !user) || rcu_preempt_depth();
> 
> Thoughts?

This looks to me like it would cover it.  If !PREEMPT interrupt from
kernel, we halt, which would prevent the sleep.

I take it that we get unhalted when the host gets things patched up?

> A side thing is being broken already for PREEMPT=n means we maybe fail
> to detect this in rcutorture? Then should we add a config with
> KVM_GUEST=y and try to run some memory consuming things(e.g. stress
> --vm) in the rcutorture kvm script simultaneously? Paolo, do you have
> any test workload that could trigger async PF quickly?

I do not believe that have seen this in rcutorture, but I always run in
a guest OS on a large-memory system (well, by my old-fashioned standards,
anyway) that would be quite unlikely to evict a guest OS's pages.  Plus
I tend to run on shared systems, and deliberately running them out of
memory would not be particularly friendly to others using those systems.

I -do- run background scripts that are intended to force the host OS to
preempt the guest OSes frequently, but I don't believe that this would
cause that bug.

But it seems like it would make more sense to add this sort of thing to
whatever KVM tests there are for host-side eviction of guest pages.

							Thanx, Paul