lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <877dyqkj3h.fsf@nanos.tec.linutronix.de>
Date:   Wed, 08 Apr 2020 00:48:50 +0200
From:   Thomas Gleixner <tglx@...utronix.de>
To:     Andy Lutomirski <luto@...capital.net>
Cc:     Vivek Goyal <vgoyal@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Andy Lutomirski <luto@...nel.org>,
        Paolo Bonzini <pbonzini@...hat.com>,
        LKML <linux-kernel@...r.kernel.org>, X86 ML <x86@...nel.org>,
        kvm list <kvm@...r.kernel.org>, stable <stable@...r.kernel.org>
Subject: Re: [PATCH v2] x86/kvm: Disable KVM_ASYNC_PF_SEND_ALWAYS

Andy Lutomirski <luto@...capital.net> writes:
>> On Apr 7, 2020, at 1:20 PM, Thomas Gleixner <tglx@...utronix.de> wrote:
>> Andy Lutomirski <luto@...capital.net> writes:
>>> “Page is malfunctioning” is tricky because you *must* deliver the
>>> event. x86’s #MC is not exactly a masterpiece, but it does kind of
>>> work.
>> 
>> Nooooo. This does not need #MC at all. Don't even think about it.
>
> Yessssssssssss.  Please do think about it. :)

I stared too much into that code recently that even thinking about it
hurts. :)

>> The point is that the access to such a page is either happening in user
>> space or in kernel space with a proper exception table fixup.
>> 
>> That means a real #PF is perfectly fine. That can be injected any time
>> and does not have the interrupt semantics of async PF.
>
> The hypervisor has no way to distinguish between
> MOV-and-has-valid-stack-and-extable-entry and
> MOV-definitely-can’t-fault-here.  Or, for that matter,
> MOV-in-do_page_fault()-will-recurve-if-it-faults.

The mechanism which Vivek wants to support has a well defined usage
scenario, i.e. either user space or kernel-valid-stack+extable-entry.

So why do you want to route that through #MC? 

>> So now lets assume we distangled async PF from #PF and made it a regular
>> interrupt, then the following situation still needs to be dealt with:
>> 
>>   guest -> access faults
>> 
>> host -> injects async fault
>> 
>>   guest -> handles and blocks the task
>> 
>> host figures out that the page does not exist anymore and now needs to
>> fixup the situation.
>> 
>> host -> injects async wakeup
>> 
>>   guest -> returns from aysnc PF interrupt and retries the instruction
>>            which faults again.
>> 
>> host -> knows by now that this is a real fault and injects a proper #PF
>> 
>>   guest -> #PF runs and either sends signal to user space or runs
>>            the exception table fixup for a kernel fault.
>
> Or guest blows up because the fault could not be recovered using #PF.

Not for the use case at hand. And for that you really want to use
regular #PF.

The scenario I showed above is perfectly legit:

   guest:
        copy_to_user()          <- Has extable
           -> FAULT

host:
   Oh, page is not there, give me some time to figure it out.

   inject async fault

   guest:
        handles async fault interrupt, enables interrupts, blocks

host:
   Situation resolved, shared file was truncated. Tell guest
  
   Inject #MC

   guest:
        handles #MC completely out of context because the faulting
        task is scheduled out.

        Rumage through wait lists (how do you protect them?) and find
        the waiter.

        Schedule irq_work to actually mark the waiter as failed and wake
        it up.

Sorry, but that's not really an improvement. That's worse. 

> I can see two somewhat sane ways to make this work.
>
> 1. Access to bad memory results in an async-page-not-present, except
> that, it’s not deliverable, the guest is killed.

That's incorrect. The proper reaction is a real #PF. Simply because this
is part of the contract of sharing some file backed stuff between host
and guest in a well defined "virtio" scenario and not a random access to
memory which might be there or not.

Look at it from the point where async whatever does not exist at all:

   guest:
        copy_to_user()          <- Has extable
           -> FAULT

host:
        suspend guest and resolve situation

        if (page swapped in)
           resume_guest();
        else
           inject_pf();

And this inject_pf() does not care whether it kills the guest or makes
it double/triple fault or whatever.

The 'tell the guest to do something else while host tries to sort it'
opportunistic thingy turns this into:

   guest:
        copy_to_user()          <- Has extable
           -> FAULT

host:
        tell guest to do something else, i.e. guest suspends task

        if (page swapped in)
           tell guest to resume suspended task
        else
           tell guest to resume suspended task

   guest resumes and faults again

host:
           inject_pf();

which is pretty much equivalent.

> 2. Access to bad memory results in #MC.  Sure, #MC is a turd, but it’s
> an *architectural* turd. By all means, have a nice simple PV mechanism
> to tell the #MC code exactly what went wrong, but keep the overall
> flow the same as in the native case.

It's a completely different flow as you evaluate PV turd instead of
analysing the MCE banks and the other error reporting facilities.

> I think I like #2 much better. It has another nice effect: a good
> implementation will serve as a way to exercise the #MC code without
> needing to muck with EINJ or with whatever magic Tony uses. The
> average kernel developer does not have access to a box with testable
> memory failure reporting.

Yes, because CPU dudes decided that documenting the testability
mechanisms is a bad thing. They surely exist and for Nehalem at least
the ECC error injection mechanism was documented and usable.

But to actually make soemthing useful you need to emulate the bank
interface on the hypervisor and expose the errors to the guest. The MCE
injection mechanism has some awful way to "emulate" MSR reads for
that. See mce_rdmsr(). *SHUDDER*

Thanks,

        tglx

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ