lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 25 Oct 2021 16:08:37 +0200
From:   Ard Biesheuvel <ardb@...nel.org>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Frederic Weisbecker <frederic@...nel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        James Morse <james.morse@....com>,
        David Laight <David.Laight@...lab.com>,
        Quentin Perret <qperret@...gle.com>,
        Catalin Marinas <catalin.marinas@....com>,
        Will Deacon <will@...nel.org>,
        Mark Rutland <mark.rutland@....com>
Subject: Re: [PATCH 2/4] arm64: implement support for static call trampolines

On Mon, 25 Oct 2021 at 15:57, Peter Zijlstra <peterz@...radead.org> wrote:
>
> On Mon, Oct 25, 2021 at 02:21:00PM +0200, Frederic Weisbecker wrote:
>
> > +#define __ARCH_DEFINE_STATIC_CALL_TRAMP(name, insn)                      \
> > +     asm("   .pushsection    .static_call.text, \"ax\"               \n" \
> > +         "   .align          4                                       \n" \
> > +         "   .globl          " STATIC_CALL_TRAMP_STR(name) "         \n" \
> > +         "0: .quad   0x0                                             \n" \
> > +         STATIC_CALL_TRAMP_STR(name) ":                              \n" \
> > +         "   hint    34      /* BTI C */                             \n" \
> > +             insn "                                                  \n" \
> > +         "   ldr     x16, 0b                                         \n" \
> > +         "   cbz     x16, 1f                                         \n" \
> > +         "   br      x16                                             \n" \
> > +         "1: ret                                                     \n" \
> > +         "   .popsection                                             \n")
>
> > +void arch_static_call_transform(void *site, void *tramp, void *func, bool tail)
> > +{
> > +     /*
> > +      * -0x8 <literal>
> > +      *  0x0 bti c           <--- trampoline entry point
> > +      *  0x4 <branch or nop>
> > +      *  0x8 ldr x16, <literal>
> > +      *  0xc cbz x16, 20
> > +      * 0x10 br x16
> > +      * 0x14 ret
> > +      */
> > +     struct {
> > +             u64     literal;
> > +             __le32  insn[2];
> > +     } insns;
> > +     u32 insn;
> > +     int ret;
> > +
> > +     insn = aarch64_insn_gen_hint(AARCH64_INSN_HINT_BTIC);
> > +     insns.literal = (u64)func;
> > +     insns.insn[0] = cpu_to_le32(insn);
> > +
> > +     if (!func) {
> > +             insn = aarch64_insn_gen_branch_reg(AARCH64_INSN_REG_LR,
> > +                                                AARCH64_INSN_BRANCH_RETURN);
> > +     } else {
> > +             insn = aarch64_insn_gen_branch_imm((u64)tramp + 4, (u64)func,
> > +                                                AARCH64_INSN_BRANCH_NOLINK);
> > +
> > +             /*
> > +              * Use a NOP if the branch target is out of range, and rely on
> > +              * the indirect call instead.
> > +              */
> > +             if (insn == AARCH64_BREAK_FAULT)
> > +                     insn = aarch64_insn_gen_hint(AARCH64_INSN_HINT_NOP);
> > +     }
> > +     insns.insn[1] = cpu_to_le32(insn);
> > +
> > +     ret = __aarch64_insn_write(tramp - 8, &insns, sizeof(insns));
>
> OK, that's pretty magical...
>
> So you're writing the literal and the two instructions with 2 u64
> stores. Relying on alignment to guarantee both are in a single page and
> that copy_to_kernel_nofault() selects u64 writes.
>

To be honest, it just seemed tidier and less likely to produce weird
corner cases to put the literal and the patched insn in the smallest
possible power-of-2 aligned window, as it ensures that the D-side view
is always consistent.

However, the actual fetch of the instruction could still produce a
stale value before the cache maintenance completes.

> By unconditionally writing the literal, you avoid there ever being an
> stale value, which in turn avoids there being a race where you switch
> from 'J @func' relative addressing to 'NOP; do-literal-thing' and cross
> CPU execution gets the ordering inverted.
>

Indeed.

> Ooohh, but what if you go from !func to NOP.
>
> assuming:
>
>         .literal = 0
>         BTI C
>         RET
>
> Then
>
>         CPU0                    CPU1
>
>         [S] literal = func      [I] NOP
>         [S] insn[1] = NOP       [L] x16 = literal (NULL)
>                                 b x16
>                                 *BANG*
>
> Is that possible? (total lack of memory ordering etc..)
>

The CBZ will branch to the RET instruction if x16 == 0x0, so this
should not happen.

> On IRC you just alluded to the fact that this relies on it all being in
> a single cacheline (i-fetch windows don't need to be cacheline sized,
> but provided they're at least 16 bytes, this should still work given the
> alignment).
>
> But is I$ and D$ coherent? One load is through I-fetch, the other is a
> regular D-fetch.
>
> However, Will has previously expressed reluctance to rely on such
> things.
>

No they are not. That is why the CBZ is there. So the only issue we
might see is where the branch instruction is out of sync with the
literal, and so we may call the old function while switching to the
new one and the I-cache maintenance hasn't completed yet.

> > +     if (!WARN_ON(ret))
> > +             caches_clean_inval_pou((u64)tramp - 8, sizeof(insns));
> >  }

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ