linux-kernel - Re: [PATCH v2] x86/kvm: Disable KVM_ASYNC_PF_SEND

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200407224903.GC64635@redhat.com>
Date:   Tue, 7 Apr 2020 18:49:03 -0400
From:   Vivek Goyal <vgoyal@...hat.com>
To:     Andy Lutomirski <luto@...capital.net>
Cc:     Thomas Gleixner <tglx@...utronix.de>,
        Peter Zijlstra <peterz@...radead.org>,
        Andy Lutomirski <luto@...nel.org>,
        Paolo Bonzini <pbonzini@...hat.com>,
        LKML <linux-kernel@...r.kernel.org>, X86 ML <x86@...nel.org>,
        kvm list <kvm@...r.kernel.org>, stable <stable@...r.kernel.org>
Subject: Re: [PATCH v2] x86/kvm: Disable KVM_ASYNC_PF_SEND_ALWAYS

On Tue, Apr 07, 2020 at 02:41:02PM -0700, Andy Lutomirski wrote:
> 
> 
> > On Apr 7, 2020, at 1:20 PM, Thomas Gleixner <tglx@...utronix.de> wrote:
> > 
> > Andy Lutomirski <luto@...capital.net> writes:
> >>>> On Apr 7, 2020, at 10:21 AM, Vivek Goyal <vgoyal@...hat.com> wrote:
> >>> Whether interrupts are enabled or not check only happens before we decide
> >>> if async pf protocol should be followed or not. Once we decide to
> >>> send PAGE_NOT_PRESENT, later notification PAGE_READY does not check
> >>> if interrupts are enabled or not. And it kind of makes sense otherwise
> >>> guest process will wait infinitely to receive PAGE_READY.
> >>> 
> >>> I modified the code a bit to disable interrupt and wait 10 seconds (after
> >>> getting PAGE_NOT_PRESENT message). And I noticed that error async pf
> >>> got delivered after 10 seconds after enabling interrupts. So error
> >>> async pf was not lost because interrupts were disabled.
> > 
> > Async PF is not the same as a real #PF. It just hijacked the #PF vector
> > because someone thought this is a brilliant idea.
> > 
> >>> Havind said that, I thought disabling interrupts does not mask exceptions.
> >>> So page fault exception should have been delivered even with interrupts
> >>> disabled. Is that correct? May be there was no vm exit/entry during
> >>> those 10 seconds and that's why.
> > 
> > No. Async PF is not a real exception. It has interrupt semantics and it
> > can only be injected when the guest has interrupts enabled. It's bad
> > design.
> > 
> >> My point is that the entire async pf is nonsense. There are two types of events right now:
> >> 
> >> “Page not ready”: normally this isn’t even visible to the guest — the
> >> guest just waits. With async pf, the idea is to try to tell the guest
> >> that a particular instruction would block and the guest should do
> >> something else instead. Sending a normal exception is a poor design,
> >> though: the guest may not expect this instruction to cause an
> >> exception. I think KVM should try to deliver an *interrupt* and, if it
> >> can’t, then just block the guest.
> > 
> > That's pretty much what it does, just that it runs this through #PF and
> > has the checks for interrupts disabled - i.e can't right now' around
> > that. If it can't then KVM schedules the guest out until the situation
> > has been resolved.
> > 
> >> “Page ready”: this is a regular asynchronous notification just like,
> >> say, a virtio completion. It should be an ordinary interrupt.  Some in
> >> memory data structure should indicate which pages are ready.
> >> 
> >> “Page is malfunctioning” is tricky because you *must* deliver the
> >> event. x86’s #MC is not exactly a masterpiece, but it does kind of
> >> work.
> > 
> > Nooooo. This does not need #MC at all. Don't even think about it.
> 
> Yessssssssssss.  Please do think about it. :)
> 
> > 
> > The point is that the access to such a page is either happening in user
> > space or in kernel space with a proper exception table fixup.
> > 
> > That means a real #PF is perfectly fine. That can be injected any time
> > and does not have the interrupt semantics of async PF.
> 
> The hypervisor has no way to distinguish between MOV-and-has-valid-stack-and-extable-entry and MOV-definitely-can’t-fault-here.  Or, for that matter, MOV-in-do_page_fault()-will-recurve-if-it-faults.
> 
> > 
> > So now lets assume we distangled async PF from #PF and made it a regular
> > interrupt, then the following situation still needs to be dealt with:
> > 
> >   guest -> access faults
> > 
> > host -> injects async fault
> > 
> >   guest -> handles and blocks the task
> > 
> > host figures out that the page does not exist anymore and now needs to
> > fixup the situation.
> > 
> > host -> injects async wakeup
> > 
> >   guest -> returns from aysnc PF interrupt and retries the instruction
> >            which faults again.
> > 
> > host -> knows by now that this is a real fault and injects a proper #PF
> > 
> >   guest -> #PF runs and either sends signal to user space or runs
> >            the exception table fixup for a kernel fault.
> 
> Or guest blows up because the fault could not be recovered using #PF.
> 
> I can see two somewhat sane ways to make this work.
> 
> 1. Access to bad memory results in an async-page-not-present, except that,  it’s not deliverable, the guest is killed. Either that async-page-not-present has a special flag saying “memory failure” or the eventual wakeup says “memory failure”.
> 
> 2. Access to bad memory results in #MC.  Sure, #MC is a turd, but it’s an *architectural* turd. By all means, have a nice simple PV mechanism to tell the #MC code exactly what went wrong, but keep the overall flow the same as in the native case.
> 

Should we differentiate between two error cases. In this case memory is
not bad. Just that page being asked for can't be faulted in anymore. And
another error case is real memory failure. So for the first case it
could be injected into guest using #PF or #VE or something paravirt
specific and real hardware failures take #MC path (if they are not
already doing so).

Thanks
Vivek