linux-kernel - Re: RFC: userspace exception fixups

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <7FF4802E-FBC5-4E6D-A8F6-8A65114F18C7@amacapital.net>
Date:   Tue, 6 Nov 2018 15:00:56 -0800
From:   Andy Lutomirski <luto@...capital.net>
To:     Sean Christopherson <sean.j.christopherson@...el.com>
Cc:     Andy Lutomirski <luto@...nel.org>,
        Dave Hansen <dave.hansen@...el.com>,
        Jann Horn <jannh@...gle.com>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Rich Felker <dalias@...c.org>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        Jethro Beekman <jethro@...tanix.com>,
        Jarkko Sakkinen <jarkko.sakkinen@...ux.intel.com>,
        Florian Weimer <fweimer@...hat.com>,
        Linux API <linux-api@...r.kernel.org>, X86 ML <x86@...nel.org>,
        linux-arch <linux-arch@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Peter Zijlstra <peterz@...radead.org>, nhorman@...hat.com,
        npmccallum@...hat.com, "Ayoun, Serge" <serge.ayoun@...el.com>,
        shay.katz-zamir@...el.com, linux-sgx@...r.kernel.org,
        Andy Shevchenko <andriy.shevchenko@...ux.intel.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        Carlos O'Donell <carlos@...hat.com>,
        adhemerval.zanella@...aro.org
Subject: Re: RFC: userspace exception fixups



>> On Nov 6, 2018, at 1:59 PM, Sean Christopherson <sean.j.christopherson@...el.com> wrote:
>> 
>>> On Tue, 2018-11-06 at 13:41 -0800, Andy Lutomirski wrote:
>>>> On Tue, Nov 6, 2018 at 1:07 PM Andy Lutomirski <luto@...capital.net> wrote:
>>>> 
>>>> 
>>>>> On Nov 6, 2018, at 1:00 PM, Dave Hansen <dave.hansen@...el.com> wrote:
>>>>> 
>>>>> 
>>>>> On 11/6/18 12:12 PM, Andy Lutomirski wrote:
>>>>> True, but what if we have a nasty enclave that writes to memory just
>>>>> below SP *before* decrementing SP?
>>>> Yeah, that would be unfortunate.  If an enclave did this (roughly):
>>>> 
>>>>    1. EENTER
>>>>    2. Hardware sets eenter_hwframe->sp = %sp
>>>>    3. Enclave runs... wants to do out-call
>>>>    4. Enclave sets up parameters:
>>>>        memcpy(&eenter_hwframe->sp[-offset], arg1, size);
>>>>        ...
>>>>    5. Enclave sets eenter_hwframe->sp -= offset
>>>> 
>>>> If we got a signal between 4 and 5, we'd clobber the copy of 'arg1' that
>>>> was on the stack.  The enclave could easily fix this by moving ->sp first.
>>>> 
>>>> But, this is one of those "fun" parts of the ABI that I think we need to
>>>> talk about.  If we do this, we also basically require that the code
>>>> which handles asynchronous exits must *not* write to the stack.  That's
>>>> not hard because it's typically just a single ERESUME instruction, but
>>>> it *is* a requirement.
>>> I was assuming that the async exit stuff was completely hidden by the API. The AEP code would decide whether the exit got fixed up by the kernel (which may or may not be easy to tell — can the
>>> code even tell without kernel help whether it was, say, an IRQ vs #UD?) and then either do ERESUME or cause sgx_enter_enclave() to return with an appropriate return value.
>> Sean, how does the current SDK AEX handler decide whether to do
>> EENTER, ERESUME, or just bail and consider the enclave dead?  It seems
>> like the *CPU* could give a big hint, but I don't see where there is
>> any architectural indication of why the AEX code got called or any
>> obvious way for the user code to know whether the exit was fixed up by
>> the kernel?
> 
> The SDK "unconditionally" does ERESUME at the AEP location, but that's
> bit misleading because its signal handler may muck with the context's
> RIP, e.g. to abort the enclave on a fatal fault.
> 
> On an event/exception from within an enclave, the event is immediately
> delivered after loading synthetic state and changing RIP to the AEP.
> In other words, jamming CPU state is essentially a bunch of vectoring
> ucode preamble, but from software's perspective it's a normal event
> that happens to point at the AEP instead of somewhere in the enclave.
> And because the signals the SDK cares about are all synchronous, the
> SDK can simply hardcode ERESUME at the AEP since all of the fault logic
> resides in its signal handler.  IRQs and whatnot simply trampoline back
> into the enclave.
> 
> Userspace can do something funky instead of ERESUME, but only *after*
> IRET/RSM/VMRESUME has returned to the AEP location, and in Linux's
> case, after the trap handler has run.
> 
> Jumping back a bit, how much do we care about preventing userspace
> from doing stupid things? 

My general feeling is that userspace should be allowed to do apparently stupid things. For example, as far as the kernel is concerned, Wine and DOSEMU are just user programs that do stupid things. Linux generally tries to provide a reasonably complete view of architectural behavior. This is in contrast to, say, Windows, where IIUC doing an unapproved WRFSBASE May cause very odd behavior indeed. So magic fixups that do non-architectural things are not so great.

The flip side, of course, is that the architecture is arguably inherently erratic here, and it’s apparently impossible to have an SGX library with sane semantics without some kernel assistance.

So if we can make my straw man API work, perhaps with vDSO or rseq-like help, then the official SDK can use it, but less well behaved programs can still mostly work.  (Modulo Linux’s non-support for EINITTOKEN, of course.)

Thinking about it some more, the major sticking point may be finding the RIP and stack frame of EENTER in the AEP code or in its fixup. The vDSO can’t use TLS without serious hackery.  We could massively abuse WRFSBASE, but that’s really ugly.

(How does the Windows case work?  If there’s an exception after the untrusted stack allocation and before EEXIT and SEH tries to handle it, how does the unwinder figure out where to start?)

>  I did a quick POC on the idea of hardcoding
> fixup for the ENCLU opcode, and the basic idea checks out.  The code
> is fairly minimal and doesn't impact the core functionality of the SDK.
> They'd need to redo their trap handling to move it from the signal
> handler to inline, but their stack shenanigans won't be any more broken
> than they already are.