linux-kernel - Re: [patch 2/2] x86/kvm: Sanitize kvm_async_pf_task

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87r1y4a3gw.fsf@nanos.tec.linutronix.de>
Date:   Sat, 07 Mar 2020 11:01:19 +0100
From:   Thomas Gleixner <tglx@...utronix.de>
To:     Andy Lutomirski <luto@...nel.org>
Cc:     LKML <linux-kernel@...r.kernel.org>, X86 ML <x86@...nel.org>,
        Paolo Bonzini <pbonzini@...hat.com>, KVM <kvm@...r.kernel.org>,
        "Paul E. McKenney" <paulmck@...nel.org>
Subject: Re: [patch 2/2] x86/kvm: Sanitize kvm_async_pf_task_wait()

Andy Lutomirski <luto@...nel.org> writes:
>> On Mar 6, 2020, at 4:12 PM, Thomas Gleixner <tglx@...utronix.de> wrote:
>> Aside of that the actual code is a convoluted one fits it all swiss army
>> knife. It is invoked from different places with different RCU constraints:
>>
>> 1) Host side:
>>
>>   vcpu_enter_guest()
>>     kvm_x86_ops->handle_exit()
>>       kvm_handle_page_fault()
>>         kvm_async_pf_task_wait()
>>
>>   The invocation happens from fully preemptible context.
>
> I’m a bit baffled as to why the host uses this code at all instead of
> just sleeping the old fashioned way when the guest takes a (host) page
> fault.  Oh well.

If I can trust the crystal ball which I used to decode this maze then it
actually makes sense.

Aysnc faults are faults which cannot be handled by the guest, i.e. the
host either pulled a page away under the guest or did not populate it in
the first place.

So the reasoning is that if this happens the guest might be in a
situation where it can schedule other tasks instead of being stopped
completely by the host until the page arrives.

Now you could argue that this mostly makes sense for CPL 0 faults, but
there is definitely code in the kernel where it makes sense as well,
e.g. exec. But of course as this is designed without a proper handshake
there is no way for the hypervisor to figure out whether it makes sense
or not.

If the async fault cannot be delivered to the guest (async PF disabled,
async PF only enabled for CPL 0, IF == 0) then the host utilizes the
same data structure and wait mechanism. That really makes sense.

The part which does not make sense in the current implementation is the
kvm_async_pf_task_wait() trainwreck. A clear upfront separation of
schedulable and non schedulable wait mechanisms would have avoided all
the RCU duct tape nonsense and also spared me the brain damage caused by
reverse engineering this completely undocumented mess.

>> +static void kvm_async_pf_task_wait_halt(u32 token)
>> +{
>> +    struct kvm_task_sleep_node n;
>> +
>> +    if (!kvm_async_pf_queue_task(token, true, &n))
>> +        return;
>>
>>  for (;;) {
>> -        if (!n.halted)
>> -            prepare_to_swait_exclusive(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
>>      if (hlist_unhashed(&n.link))
>>          break;
>> +        /*
>> +         * No point in doing anything about RCU here. Any RCU read
>> +         * side critical section or RCU watching section can be
>> +         * interrupted by VMEXITs and the host is free to keep the
>> +         * vCPU scheduled out as long as it sees fit. This is not
>> +         * any different just because of the halt induced voluntary
>> +         * VMEXIT.
>> +         *
>> +         * Also the async page fault could have interrupted any RCU
>> +         * watching context, so invoking rcu_irq_exit()/enter()
>> +         * around this is not gaining anything.
>> +         */
>> +        native_safe_halt();
>> +        local_irq_disable();
>
> What’s the local_irq_disable() here for? I would believe a
> lockdep_assert_irqs_disabled() somewhere in here would make sense.
> (Yes, I see you copied this from the old code. It’s still nonsense.)

native_safe_halt() does:

         STI
         HLT

So the irq disable is required as the loop should exit with interrupts
disabled.

> I also find it truly bizarre that hlt actually works in this context.
> Does KVM in fact wake a HLTed guest that HLTed with IRQs off when a
> pending async pf is satisfied?  This would make sense if the wake
> event were an interrupt, but it’s not according to Paolo.

See above. safe halt enables interrupts, which means IF == 1 and the
host sanity check for IF == 1 is satisfied.

In fact, if e.g. some regular interrupt arrives before the page becomes
available and the guest entered the halt loop because the fault happened
inside a RCU read side critical section with preemption enabled, then
the interrupt might wake up another task, set need resched and this
other task can run. At some point the halt waiting task gets back into
the loop and either finds the page ready or goes back into halt.

If the fault hit a preempt disabled region then still the interrupt and
eventual resulting soft interrupts can be handled.

Both scenarios are correct when you actually manage to mentaly
disconnect regular #PF and async #PF completely.

Of course the overloading of regular #PF, the utter lack of
documentation and the crappy and duct taped implementation makes this
really a mind boggling exercise.

> All this being said, the only remotely sane case is when regs->flags
> has IF==1. Perhaps this code should actually do:
>
> WARN_ON(!(regs->flags & X86_EFLAGS_IF));

Yes, that want's to be somewhere early and also cover the async wake
case. Neither wake nor wait can be injected when IF == 0.

> while (the page isn’t ready) {
>  local_irq_enable();
>  native_safe_halt();
>  local_irq_disable();
> }
>
> with some provision to survive the case where the warning fires so we
> at least get logs.

I don't think that any attempt to survive a async #PF injection into a
interrupt disabled region makes sense aside of looking smart and being
uncomprehensible voodoo.

If this ever happens then the host side is completely buggered and all
we can do is warn and pray or warn and die hard.

My personal preference is to warn and die hard.

> In any event, I just sent a patch to disable async page faults that
> happen in kernel mode.

I don't think it's justified. The host side really makes sure that the
guest does have IF == 1 before injecting anything which is not a NMI. If
the guest enables interrupts at the wrong place then this is really not
the hosts problem.

Having a warning in that async pf entry for the IF == 0 injection case
is good enough IMO.

Thanks,

        tglx