linux-kernel - Re: [patch V2 04/25] x86/apic: Make apic_pending_intr

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALCETrVomGF-OmWxdaX9axih1kz345rEFop=vZtcKwGR8U-gwQ@mail.gmail.com>
Date:   Fri, 5 Jul 2019 12:06:31 -0700
From:   Andy Lutomirski <luto@...nel.org>
To:     Andrew Cooper <andrew.cooper3@...rix.com>,
        Josh Poimboeuf <jpoimboe@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>
Cc:     Thomas Gleixner <tglx@...utronix.de>,
        LKML <linux-kernel@...r.kernel.org>, X86 ML <x86@...nel.org>,
        Nadav Amit <namit@...are.com>,
        Ricardo Neri <ricardo.neri-calderon@...ux.intel.com>,
        Stephane Eranian <eranian@...gle.com>,
        Feng Tang <feng.tang@...el.com>,
        Andy Lutomirski <luto@...nel.org>
Subject: Re: [patch V2 04/25] x86/apic: Make apic_pending_intr_clear() more robust

On Fri, Jul 5, 2019 at 8:47 AM Andrew Cooper <andrew.cooper3@...rix.com> wrote:
>
> On 04/07/2019 16:51, Thomas Gleixner wrote:
> >   2) The loop termination logic is interesting at best.
> >
> >      If the machine has no TSC or cpu_khz is not known yet it tries 1
> >      million times to ack stale IRR/ISR bits. What?
> >
> >      With TSC it uses the TSC to calculate the loop termination. It takes a
> >      timestamp at entry and terminates the loop when:
> >
> >         (rdtsc() - start_timestamp) >= (cpu_hkz << 10)
> >
> >      That's roughly one second.
> >
> >      Both methods are problematic. The APIC has 256 vectors, which means
> >      that in theory max. 256 IRR/ISR bits can be set. In practice this is
> >      impossible as the first 32 vectors are reserved and not affected and
> >      the chance that more than a few bits are set is close to zero.
>
> [Disclaimer.  I talked to Thomas in private first, and he asked me to
> post this publicly as the CVE is almost a decade old already.]
>
> I'm afraid that this isn't quite true.
>
> In terms of IDT vectors, the first 32 are reserved for exceptions, but
> only the first 16 are reserved in the LAPIC.  Vectors 16-31 are fair
> game for incoming IPIs (SDM Vol3, 10.5.2 Valid Interrupt Vectors).
>
> In practice, this makes Linux vulnerable to CVE-2011-1898 / XSA-3, which
> I'm disappointed to see wasn't shared with other software vendors at the
> time.
>
> Because TPR is 0, an incoming IPI can trigger #AC, #CP, #VC or #SX
> without an error code on the stack, which results in a corrupt pt_regs
> in the exception handler, and a stack underflow on the way back out,
> most likely with a fault on IRET.
>
> These can be addressed by setting TPR to 0x10, which will inhibit
> delivery of any errant IPIs in this range, but some extra sanity logic
> may not go amiss.  An error code on a 64bit stack can be spotted with
> `testb $8, %spl` due to %rsp being aligned before pushing the exception
> frame.

Several years ago, I remember having a discussion with someone (Jan
Beulich, maybe?) about how to efficiently make the entry code figure
out the error code situation automatically.  I suspect it was on IRC
and I can't find the logs.  I'm thinking that maybe we should just
make Linux's idtentry code do something like this.

If nothing else, we could make idtentry do:

testl $8, %esp   /* shorter than testb IIRC */
jz 1f  /* or jnz -- too lazy to figure it out */
pushq $-1
1:

instead of the current hardcoded push.  The cost of a mispredicted
branch here will be smallish compared to the absurdly large cost of
the entry itself.  But I thought I had something more clever than
this.  This sequence works, but it still feels like it should be
possible to do better:

.macro PUSH_ERROR_IF_NEEDED
    /*
     * Before the IRET frame is pushed, RSP is aligned to a 16-byte
     * boundary.  After SS .. RIP and the error code are pushed, RSP is
     * once again aligned.  Pushing -1 will put -1 in the error code slot
     * (regs->orig_ax) if there was no error code.
    */

    pushq    $-1                /* orig_ax = -1, maybe */
    /* now RSP points to orig_ax (aligned) or di (misaligned) */
    pushq    $0
    /* now RSP points to di (misaligned) or si (aligned) */
    orq    $8, %rsp
    /* now RSP points to di */
    addq    $8, %rsp
    /* now RSP points to orig_ax, and we're in good shape */
.endm

Is there a better sequence for this?

A silly downside here is that the ORC annotations need to know the
offset to the IRET frame.  Josh, how ugly would it be to teach the
unwinder that UNWIND_HINT_IRET_REGS actually means "hey, maybe I'm 8
bytes off -- please realign RSP when doing your calculation"?

FWIW, the entry code is rather silly here in that it actually only
uses the orig_ax slot as a temporary dumping ground for the error code
and then it replaces it with -1 later on.  I don't remember whether
anything still cares about the -1.  Once upon a time, there was some
code that assumed that -1 meant "not in a syscall" and anything else
meant "in a syscall", but, if so, I suspect we should just kill that
code regardless.


>
> Another interesting problem is an IPI which its vector 0x80.  A cunning
> attacker can use this to simulate system calls from unsuspecting
> positions in userspace, or for interrupting kernel context.  At the very
> least the int0x80 path does an unconditional swapgs, so will try to run
> with the user gs, and I expect things will explode quickly from there.

At least SMAP helps here on non-FSGSBASE systems.  With FSGSBASE, I
suppose we could harden this by adding a special check to int $0x80 to
validate GSBASE.

>
> One option here is to look at ISR and complain if it is found to be set.

Barring some real hackery, we're toast long before we get far enough to do that.