lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210510134503.GA88495@C02TD0UTHF1T.local>
Date:   Mon, 10 May 2021 14:45:03 +0100
From:   Mark Rutland <mark.rutland@....com>
To:     Kees Cook <keescook@...omium.org>
Cc:     linux-hardening@...r.kernel.org, Qing Zhao <qing.zhao@...cle.com>,
        Masahiro Yamada <masahiroy@...nel.org>,
        Michal Marek <michal.lkml@...kovi.net>,
        linux-kernel@...r.kernel.org, linux-kbuild@...r.kernel.org,
        linux-security-module@...r.kernel.org
Subject: Re: [PATCH] Makefile: Introduce CONFIG_ZERO_CALL_USED_REGS

On Thu, May 06, 2021 at 02:24:18PM -0700, Kees Cook wrote:
> On Thu, May 06, 2021 at 01:54:57PM +0100, Mark Rutland wrote:
> > Hi Kees,
> > 
> > On Wed, May 05, 2021 at 12:18:04PM -0700, Kees Cook wrote:
> > > When CONFIG_ZERO_CALL_USED_REGS is enabled, build the kernel with
> > > "-fzero-call-used-regs=used-gpr" (in GCC 11). This option will zero any
> > > caller-used register contents just before returning from a function,
> > > ensuring that temporary values are not leaked beyond the function
> > > boundary. This means that register contents are less likely to be
> > > available for side channel attacks and information exposures.
> > > 
> > > Additionally this helps reduce the number of useful ROP gadgets in the
> > > kernel image by about 20%:
> > > 
> > > $ ROPgadget.py --nosys --nojop --binary vmlinux.stock | tail -n1
> > > Unique gadgets found: 337245
> > > 
> > > $ ROPgadget.py --nosys --nojop --binary vmlinux.zero-call-regs | tail -n1
> > > Unique gadgets found: 267175
> > > 
> > > and more notably removes simple "write-what-where" gadgets:
> > > 
> > > $ ROPgadget.py --ropchain --binary vmlinux.stock | sed -n '/Step 1/,/Step 2/p'
> > > - Step 1 -- Write-what-where gadgets
> > > 
> > >         [+] Gadget found: 0xffffffff8102d76c mov qword ptr [rsi], rdx ; ret
> > >         [+] Gadget found: 0xffffffff81000cf5 pop rsi ; ret
> > >         [+] Gadget found: 0xffffffff8104d7c8 pop rdx ; ret
> > >         [-] Can't find the 'xor rdx, rdx' gadget. Try with another 'mov [reg], reg'
> > > 
> > >         [+] Gadget found: 0xffffffff814c2b4c mov qword ptr [rsi], rdi ; ret
> > >         [+] Gadget found: 0xffffffff81000cf5 pop rsi ; ret
> > >         [+] Gadget found: 0xffffffff81001e51 pop rdi ; ret
> > >         [-] Can't find the 'xor rdi, rdi' gadget. Try with another 'mov [reg], reg'
> > > 
> > >         [+] Gadget found: 0xffffffff81540d61 mov qword ptr [rsi], rdi ; pop rbx ; pop rbp ; ret
> > >         [+] Gadget found: 0xffffffff81000cf5 pop rsi ; ret
> > >         [+] Gadget found: 0xffffffff81001e51 pop rdi ; ret
> > >         [-] Can't find the 'xor rdi, rdi' gadget. Try with another 'mov [reg], reg'
> > > 
> > >         [+] Gadget found: 0xffffffff8105341e mov qword ptr [rsi], rax ; ret
> > >         [+] Gadget found: 0xffffffff81000cf5 pop rsi ; ret
> > >         [+] Gadget found: 0xffffffff81029a11 pop rax ; ret
> > >         [+] Gadget found: 0xffffffff811f1c3b xor rax, rax ; ret
> > > 
> > > - Step 2 -- Init syscall number gadgets
> > > 
> > > $ ROPgadget.py --ropchain --binary vmlinux.zero* | sed -n '/Step 1/,/Step 2/p'
> > > - Step 1 -- Write-what-where gadgets
> > > 
> > >         [-] Can't find the 'mov qword ptr [r64], r64' gadget
> > > 
> > > In parallel build tests, this has a less than 1% performance impact,
> > > and grows the image size less than 1%:
> > > 
> > > $ size vmlinux.stock vmlinux.zero-call-regs
> > >    text    data     bss     dec     hex filename
> > > 22437676   8559152 14127340 45124168 2b08a48 vmlinux.stock
> > > 22453184   8563248 14110956 45127388 2b096dc vmlinux.zero-call-regs
> > 
> > FWIW, I gave this a go on arm64, and the size increase is a fair bit
> > larger:
> > 
> > | [mark@...rids:~/src/linux]% ls -l Image* 
> > | -rw-r--r-- 1 mark mark 31955456 May  6 13:36 Image.stock
> > | -rw-r--r-- 1 mark mark 33724928 May  6 13:23 Image.zero-call-regs
> > 
> > | [mark@...rids:~/src/linux]% size vmlinux.stock vmlinux.zero-call-regs 
> > |    text    data     bss     dec     hex filename
> > | 20728552        11086474         505540 32320566        1ed2c36 vmlinux.stock
> > | 22500688        11084298         505540 34090526        2082e1e vmlinux.zero-call-regs
> > 
> > The Image is ~5.5% bigger, and the .text in the vmlinux is ~8.5% bigger
> 
> Woo, that's quite a bit larger! So much so that I struggle to imagine
> the delta. That's almost 1 extra instruction for every 10. 

About 31% of this seems to be due to GCC (almost) always clearing x16
and x17 (see further down for numbers). I suspect that's because GCC has
to assume that any (non-static) functions might be reached via a PLT
which would clobber x16 and x17 with specific values.

We also have a bunch of small functions with multiple returns, where
each return path gets the full complement of zeroing instructions, e.g.

Stock:

| <fpsimd_sync_to_sve>:
|        d503245f        bti     c
|        f9400001        ldr     x1, [x0]
|        7209003f        tst     w1, #0x800000
|        54000040        b.eq    ffff800010014cc4 <fpsimd_sync_to_sve+0x14>  // b.none
|        d65f03c0        ret
|        d503233f        paciasp
|        a9bf7bfd        stp     x29, x30, [sp, #-16]!
|        910003fd        mov     x29, sp
|        97fffdac        bl      ffff800010014380 <fpsimd_to_sve>
|        a8c17bfd        ldp     x29, x30, [sp], #16
|        d50323bf        autiasp
|        d65f03c0        ret

With zero-call-regs:

| <fpsimd_sync_to_sve>:
|        d503245f        bti     c
|        f9400001        ldr     x1, [x0]
|        7209003f        tst     w1, #0x800000
|        540000c0        b.eq    ffff8000100152a8 <fpsimd_sync_to_sve+0x24>  // b.none
|        d2800000        mov     x0, #0x0                        // #0
|        d2800001        mov     x1, #0x0                        // #0
|        d2800010        mov     x16, #0x0                       // #0
|        d2800011        mov     x17, #0x0                       // #0
|        d65f03c0        ret
|        d503233f        paciasp
|        a9bf7bfd        stp     x29, x30, [sp, #-16]!
|        910003fd        mov     x29, sp
|        97fffd17        bl      ffff800010014710 <fpsimd_to_sve>
|        a8c17bfd        ldp     x29, x30, [sp], #16
|        d50323bf        autiasp
|        d2800000        mov     x0, #0x0                        // #0
|        d2800001        mov     x1, #0x0                        // #0
|        d2800010        mov     x16, #0x0                       // #0
|        d2800011        mov     x17, #0x0                       // #0
|        d65f03c0        ret

... where we go from 12 instructions to 20, which is a ~67% bloat.

> I don't imagine functions are that short. There seem to be only r9..r15 as
> call-used.

We have a bunch of cases like the above. Also note that per the AAPCS a
function can clobber x0-17 (and x18 if it's not reserved for something
like SCS), and I see a few places that clobber x1-x17.

> Even if every one was cleared at every function exit (28
> bytes), that implies 63,290 functions, with an average function size of
> 40 instructions?

I generated some (slightly dodgy) numbers by grepping the output of
objdump:

[mark@...rids:~/src/linux]% usekorg 10.1.0 aarch64-linux-objdump -d vmlinux.stock | wc -l                                
3979677
[mark@...rids:~/src/linux]% usekorg 10.1.0 aarch64-linux-objdump -d vmlinux.stock | grep 'mov\sx[0-9]\+, #0x0' | wc -l 
50070
[mark@...rids:~/src/linux]% usekorg 10.1.0 aarch64-linux-objdump -d vmlinux.stock | grep 'mov\sx1[67], #0x0' | wc -l
1

[mark@...rids:~/src/linux]% usekorg 10.1.0 aarch64-linux-objdump -d vmlinux.zero-call-regs | wc -l                                
4422188
[mark@...rids:~/src/linux]% usekorg 10.1.0 aarch64-linux-objdump -d vmlinux.zero-call-regs | grep 'mov\sx[0-9]\+, #0x0' | wc -l
491371
[mark@...rids:~/src/linux]% usekorg 10.1.0 aarch64-linux-objdump -d vmlinux.zero-call-regs | grep 'mov\sx1[67], #0x0' | wc -l 
135729

That's 441301 new MOVs, and the equivalent of 442511 new instructions
overall. There are 135728 new MOVs to x16 and x17 specifically, which
account for ~31% of that.

Overall we go from MOVs being ~1.3% of all instructions to 11%.

> > The resulting Image appears to work, but I haven't done anything beyond
> > booting, and I wasn't able to get ROPgadget.py going to quantify the
> > number of gadgets.
> 
> Does it not like arm64 machine code? I can go check and see if I can get
> numbers...

It's supposed to, and I suspect it works fine, but I wasn't able to get
the tool running at all due to environment problems on my machine.

Thanks,
Mark.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ