[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALCETrWc0wM1x-mAcKCPRUiGtzONtXiNVMFgWZwkRD3v3K3jsA@mail.gmail.com>
Date: Sat, 7 Mar 2020 07:10:47 -0800
From: Andy Lutomirski <luto@...nel.org>
To: Thomas Gleixner <tglx@...utronix.de>
Cc: Andy Lutomirski <luto@...nel.org>,
LKML <linux-kernel@...r.kernel.org>, X86 ML <x86@...nel.org>,
Paolo Bonzini <pbonzini@...hat.com>, KVM <kvm@...r.kernel.org>,
"Paul E. McKenney" <paulmck@...nel.org>
Subject: Re: [patch 2/2] x86/kvm: Sanitize kvm_async_pf_task_wait()
On Sat, Mar 7, 2020 at 2:01 AM Thomas Gleixner <tglx@...utronix.de> wrote:
>
> Andy Lutomirski <luto@...nel.org> writes:
> >> On Mar 6, 2020, at 4:12 PM, Thomas Gleixner <tglx@...utronix.de> wrote:
> >> Aside of that the actual code is a convoluted one fits it all swiss army
> >> knife. It is invoked from different places with different RCU constraints:
> >>
> >> 1) Host side:
> >>
> >> vcpu_enter_guest()
> >> kvm_x86_ops->handle_exit()
> >> kvm_handle_page_fault()
> >> kvm_async_pf_task_wait()
> >>
> >> The invocation happens from fully preemptible context.
> >
> > I’m a bit baffled as to why the host uses this code at all instead of
> > just sleeping the old fashioned way when the guest takes a (host) page
> > fault. Oh well.
>
> If I can trust the crystal ball which I used to decode this maze then it
> actually makes sense.
>
> Aysnc faults are faults which cannot be handled by the guest, i.e. the
> host either pulled a page away under the guest or did not populate it in
> the first place.
>
> So the reasoning is that if this happens the guest might be in a
> situation where it can schedule other tasks instead of being stopped
> completely by the host until the page arrives.
>
> Now you could argue that this mostly makes sense for CPL 0 faults, but
> there is definitely code in the kernel where it makes sense as well,
> e.g. exec. But of course as this is designed without a proper handshake
> there is no way for the hypervisor to figure out whether it makes sense
> or not.
>
> If the async fault cannot be delivered to the guest (async PF disabled,
> async PF only enabled for CPL 0, IF == 0) then the host utilizes the
> same data structure and wait mechanism. That really makes sense.
>
> The part which does not make sense in the current implementation is the
> kvm_async_pf_task_wait() trainwreck. A clear upfront separation of
> schedulable and non schedulable wait mechanisms would have avoided all
> the RCU duct tape nonsense and also spared me the brain damage caused by
> reverse engineering this completely undocumented mess.
>
> >> +static void kvm_async_pf_task_wait_halt(u32 token)
> >> +{
> >> + struct kvm_task_sleep_node n;
> >> +
> >> + if (!kvm_async_pf_queue_task(token, true, &n))
> >> + return;
> >>
> >> for (;;) {
> >> - if (!n.halted)
> >> - prepare_to_swait_exclusive(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
> >> if (hlist_unhashed(&n.link))
> >> break;
> >> + /*
> >> + * No point in doing anything about RCU here. Any RCU read
> >> + * side critical section or RCU watching section can be
> >> + * interrupted by VMEXITs and the host is free to keep the
> >> + * vCPU scheduled out as long as it sees fit. This is not
> >> + * any different just because of the halt induced voluntary
> >> + * VMEXIT.
> >> + *
> >> + * Also the async page fault could have interrupted any RCU
> >> + * watching context, so invoking rcu_irq_exit()/enter()
> >> + * around this is not gaining anything.
> >> + */
> >> + native_safe_halt();
> >> + local_irq_disable();
> >
> > What’s the local_irq_disable() here for? I would believe a
> > lockdep_assert_irqs_disabled() somewhere in here would make sense.
> > (Yes, I see you copied this from the old code. It’s still nonsense.)
>
> native_safe_halt() does:
>
> STI
> HLT
>
> So the irq disable is required as the loop should exit with interrupts
> disabled.
Oops, should have looked at what native_safe_halt() does.
>
> > I also find it truly bizarre that hlt actually works in this context.
> > Does KVM in fact wake a HLTed guest that HLTed with IRQs off when a
> > pending async pf is satisfied? This would make sense if the wake
> > event were an interrupt, but it’s not according to Paolo.
>
> See above. safe halt enables interrupts, which means IF == 1 and the
> host sanity check for IF == 1 is satisfied.
>
> In fact, if e.g. some regular interrupt arrives before the page becomes
> available and the guest entered the halt loop because the fault happened
> inside a RCU read side critical section with preemption enabled, then
> the interrupt might wake up another task, set need resched and this
> other task can run.
Now I'm confused again. Your patch is very careful not to schedule if
we're in an RCU read-side critical section, but the regular preemption
code (preempt_schedule_irq, etc) seems to be willing to schedule
inside an RCU read-side critical section. Why is the latter okay but
not the async pf case?
Ignoring that, this still seems racy:
STI
nested #PF telling us to wake up
#PF returns
HLT
doesn't this result in putting the CPU asleep for no good reason until
the next interrupt hits?
> > All this being said, the only remotely sane case is when regs->flags
> > has IF==1. Perhaps this code should actually do:
> >
> > WARN_ON(!(regs->flags & X86_EFLAGS_IF));
>
> Yes, that want's to be somewhere early and also cover the async wake
> case. Neither wake nor wait can be injected when IF == 0.
Sadly, wrmsr to turn off async pf will inject wakeups even if IF == 0.
Powered by blists - more mailing lists