linux-kernel - Re: [patch 2/2] x86/kvm: Sanitize kvm_async_pf_task

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALCETrV74siTTHHWRPv+Gz=YS3SAUA6eqB6FX1XaHKvZDCbaNg@mail.gmail.com>
Date:   Fri, 6 Mar 2020 18:18:28 -0800
From:   Andy Lutomirski <luto@...nel.org>
To:     Thomas Gleixner <tglx@...utronix.de>
Cc:     LKML <linux-kernel@...r.kernel.org>, X86 ML <x86@...nel.org>,
        Paolo Bonzini <pbonzini@...hat.com>, KVM <kvm@...r.kernel.org>,
        "Paul E. McKenney" <paulmck@...nel.org>
Subject: Re: [patch 2/2] x86/kvm: Sanitize kvm_async_pf_task_wait()

> On Mar 6, 2020, at 4:12 PM, Thomas Gleixner <tglx@...utronix.de> wrote:
>
> While working on the entry consolidation I stumbled over the KVM async page
> fault handler and kvm_async_pf_task_wait() in particular. It took me a
> while to realize that the randomly sprinkled around rcu_irq_enter()/exit()
> invocations are just cargo cult programming. Several patches "fixed" RCU
> splats by curing the symptoms without noticing that the code is flawed
> from a design perspective.
>
> The main problem is that this async injection is not based on a proper
> handshake mechanism and only respects the minimal requirement, i.e. the
> guest is not in a state where it has interrupts disabled.
>
> Aside of that the actual code is a convoluted one fits it all swiss army
> knife. It is invoked from different places with different RCU constraints:
>
> 1) Host side:
>
>   vcpu_enter_guest()
>     kvm_x86_ops->handle_exit()
>       kvm_handle_page_fault()
>         kvm_async_pf_task_wait()
>
>   The invocation happens from fully preemptible context.

I’m a bit baffled as to why the host uses this code at all instead of
just sleeping the old fashioned way when the guest takes a (host) page
fault.  Oh well.

> +
> +/*
> + * kvm_async_pf_task_wait_schedule - Wait for pagefault to be handled
> + * @token:    Token to identify the sleep node entry
> + *
> + * Invoked from the async pagefault handling code or from the VM exit page
> + * fault handler. In both cases RCU is watching.
> + */
> +void kvm_async_pf_task_wait_schedule(u32 token)
> +{
> +    struct kvm_task_sleep_node n;
> +    DECLARE_SWAITQUEUE(wait);
> +
> +    lockdep_assert_irqs_disabled();
> +
> +    if (!kvm_async_pf_queue_task(token, false, &n))
>      return;
> +
> +    for (;;) {
> +        prepare_to_swait_exclusive(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
> +        if (hlist_unhashed(&n.link))
> +            break;
> +
> +        local_irq_enable();
> +        schedule();
> +        local_irq_disable();
>  }
> +    finish_swait(&n.wq, &wait);
> +}
> +EXPORT_SYMBOL_GPL(kvm_async_pf_task_wait_schedule);
>
> -    n.token = token;
> -    n.cpu = smp_processor_id();
> -    n.halted = is_idle_task(current) ||
> -           (IS_ENABLED(CONFIG_PREEMPT_COUNT)
> -            ? preempt_count() > 1 || rcu_preempt_depth()
> -            : interrupt_kernel);
> -    init_swait_queue_head(&n.wq);
> -    hlist_add_head(&n.link, &b->list);
> -    raw_spin_unlock(&b->lock);
> +/*
> + * Invoked from the async page fault handler.
> + */
> +static void kvm_async_pf_task_wait_halt(u32 token)
> +{
> +    struct kvm_task_sleep_node n;
> +
> +    if (!kvm_async_pf_queue_task(token, true, &n))
> +        return;
>
>  for (;;) {
> -        if (!n.halted)
> -            prepare_to_swait_exclusive(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
>      if (hlist_unhashed(&n.link))
>          break;
> +        /*
> +         * No point in doing anything about RCU here. Any RCU read
> +         * side critical section or RCU watching section can be
> +         * interrupted by VMEXITs and the host is free to keep the
> +         * vCPU scheduled out as long as it sees fit. This is not
> +         * any different just because of the halt induced voluntary
> +         * VMEXIT.
> +         *
> +         * Also the async page fault could have interrupted any RCU
> +         * watching context, so invoking rcu_irq_exit()/enter()
> +         * around this is not gaining anything.
> +         */
> +        native_safe_halt();
> +        local_irq_disable();

What’s the local_irq_disable() here for? I would believe a
lockdep_assert_irqs_disabled() somewhere in here would make sense.
(Yes, I see you copied this from the old code. It’s still nonsense.)

I also find it truly bizarre that hlt actually works in this context.
Does KVM in fact wake a HLTed guest that HLTed with IRQs off when a
pending async pf is satisfied?  This would make sense if the wake
event were an interrupt, but it’s not according to Paolo.

All this being said, the only remotely sane case is when regs->flags
has IF==1. Perhaps this code should actually do:

WARN_ON(!(regs->flags & X86_EFLAGS_IF));

while (the page isn’t ready) {
 local_irq_enable();
 native_safe_halt();
 local_irq_disable();
}

with some provision to survive the case where the warning fires so we
at least get logs.

In any event, I just sent a patch to disable async page faults that
happen in kernel mode.  Perhaps this patch should actually just have
some warnings to sanity check the async page faults and not even try
to handle all these nasty sub-cases.

Paolo, is there any way to test this async page faults?