linux-kernel - Re: [patch 2/2] x86/kvm: Sanitize kvm_async_pf_task

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALCETrWc0wM1x-mAcKCPRUiGtzONtXiNVMFgWZwkRD3v3K3jsA@mail.gmail.com>
Date:   Sat, 7 Mar 2020 07:10:47 -0800
From:   Andy Lutomirski <luto@...nel.org>
To:     Thomas Gleixner <tglx@...utronix.de>
Cc:     Andy Lutomirski <luto@...nel.org>,
        LKML <linux-kernel@...r.kernel.org>, X86 ML <x86@...nel.org>,
        Paolo Bonzini <pbonzini@...hat.com>, KVM <kvm@...r.kernel.org>,
        "Paul E. McKenney" <paulmck@...nel.org>
Subject: Re: [patch 2/2] x86/kvm: Sanitize kvm_async_pf_task_wait()

On Sat, Mar 7, 2020 at 2:01 AM Thomas Gleixner <tglx@...utronix.de> wrote:
>
> Andy Lutomirski <luto@...nel.org> writes:
> >> On Mar 6, 2020, at 4:12 PM, Thomas Gleixner <tglx@...utronix.de> wrote:
> >> Aside of that the actual code is a convoluted one fits it all swiss army
> >> knife. It is invoked from different places with different RCU constraints:
> >>
> >> 1) Host side:
> >>
> >>   vcpu_enter_guest()
> >>     kvm_x86_ops->handle_exit()
> >>       kvm_handle_page_fault()
> >>         kvm_async_pf_task_wait()
> >>
> >>   The invocation happens from fully preemptible context.
> >
> > I’m a bit baffled as to why the host uses this code at all instead of
> > just sleeping the old fashioned way when the guest takes a (host) page
> > fault.  Oh well.
>
> If I can trust the crystal ball which I used to decode this maze then it
> actually makes sense.
>
> Aysnc faults are faults which cannot be handled by the guest, i.e. the
> host either pulled a page away under the guest or did not populate it in
> the first place.
>
> So the reasoning is that if this happens the guest might be in a
> situation where it can schedule other tasks instead of being stopped
> completely by the host until the page arrives.
>
> Now you could argue that this mostly makes sense for CPL 0 faults, but
> there is definitely code in the kernel where it makes sense as well,
> e.g. exec. But of course as this is designed without a proper handshake
> there is no way for the hypervisor to figure out whether it makes sense
> or not.
>
> If the async fault cannot be delivered to the guest (async PF disabled,
> async PF only enabled for CPL 0, IF == 0) then the host utilizes the
> same data structure and wait mechanism. That really makes sense.
>
> The part which does not make sense in the current implementation is the
> kvm_async_pf_task_wait() trainwreck. A clear upfront separation of
> schedulable and non schedulable wait mechanisms would have avoided all
> the RCU duct tape nonsense and also spared me the brain damage caused by
> reverse engineering this completely undocumented mess.
>
> >> +static void kvm_async_pf_task_wait_halt(u32 token)
> >> +{
> >> +    struct kvm_task_sleep_node n;
> >> +
> >> +    if (!kvm_async_pf_queue_task(token, true, &n))
> >> +        return;
> >>
> >>  for (;;) {
> >> -        if (!n.halted)
> >> -            prepare_to_swait_exclusive(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
> >>      if (hlist_unhashed(&n.link))
> >>          break;
> >> +        /*
> >> +         * No point in doing anything about RCU here. Any RCU read
> >> +         * side critical section or RCU watching section can be
> >> +         * interrupted by VMEXITs and the host is free to keep the
> >> +         * vCPU scheduled out as long as it sees fit. This is not
> >> +         * any different just because of the halt induced voluntary
> >> +         * VMEXIT.
> >> +         *
> >> +         * Also the async page fault could have interrupted any RCU
> >> +         * watching context, so invoking rcu_irq_exit()/enter()
> >> +         * around this is not gaining anything.
> >> +         */
> >> +        native_safe_halt();
> >> +        local_irq_disable();
> >
> > What’s the local_irq_disable() here for? I would believe a
> > lockdep_assert_irqs_disabled() somewhere in here would make sense.
> > (Yes, I see you copied this from the old code. It’s still nonsense.)
>
> native_safe_halt() does:
>
>          STI
>          HLT
>
> So the irq disable is required as the loop should exit with interrupts
> disabled.

Oops, should have looked at what native_safe_halt() does.

>
> > I also find it truly bizarre that hlt actually works in this context.
> > Does KVM in fact wake a HLTed guest that HLTed with IRQs off when a
> > pending async pf is satisfied?  This would make sense if the wake
> > event were an interrupt, but it’s not according to Paolo.
>
> See above. safe halt enables interrupts, which means IF == 1 and the
> host sanity check for IF == 1 is satisfied.
>
> In fact, if e.g. some regular interrupt arrives before the page becomes
> available and the guest entered the halt loop because the fault happened
> inside a RCU read side critical section with preemption enabled, then
> the interrupt might wake up another task, set need resched and this
> other task can run.

Now I'm confused again.  Your patch is very careful not to schedule if
we're in an RCU read-side critical section, but the regular preemption
code (preempt_schedule_irq, etc) seems to be willing to schedule
inside an RCU read-side critical section.  Why is the latter okay but
not the async pf case?

Ignoring that, this still seems racy:

STI
nested #PF telling us to wake up
#PF returns
HLT

doesn't this result in putting the CPU asleep for no good reason until
the next interrupt hits?


> > All this being said, the only remotely sane case is when regs->flags
> > has IF==1. Perhaps this code should actually do:
> >
> > WARN_ON(!(regs->flags & X86_EFLAGS_IF));
>
> Yes, that want's to be somewhere early and also cover the async wake
> case. Neither wake nor wait can be injected when IF == 0.

Sadly, wrmsr to turn off async pf will inject wakeups even if IF == 0.