lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Fri, 15 Jul 2011 10:24:49 -0700 From: Jeremy Fitzhardinge <jeremy@...p.org> To: "H. Peter Anvin" <hpa@...or.com> CC: Peter Zijlstra <peterz@...radead.org>, Linus Torvalds <torvalds@...ux-foundation.org>, Ingo Molnar <mingo@...e.hu>, the arch/x86 maintainers <x86@...nel.org>, Linux Kernel Mailing List <linux-kernel@...r.kernel.org>, Nick Piggin <npiggin@...nel.dk>, Jeremy Fitzhardinge <jeremy.fitzhardinge@...rix.com> Subject: Re: [PATCH RFC 0/7] x86: convert ticketlocks to C and remove duplicate code On 06/24/2011 08:15 PM, H. Peter Anvin wrote: >>> Could you give us the delta in *compiled* code size? >> Sure. Do you mean for the individual lock sequences, or for the >> overall >> kernel? >> >> J > Overall is fine. Here's both ;) For the NR_CPUS < 256 case, it shrinks the kernel text by a little over 1k (and a bit of a data reduction too?): text data bss dec hex filename 7287009 1915524 2347008 11549541 b03b65 vmlinux-spin-base 7285777 1915412 2347008 11548197 b03625 vmlinux-cleantick A comparison of spin_lock before: <do_raw_spin_lock>: 55 push %rbp 48 89 e5 mov %rsp,%rbp e8 6f fa 3d 00 callq <mcount> b8 00 01 00 00 mov $0x100,%eax f0 66 0f c1 07 lock xadd %ax,(%rdi) 1: 38 e0 cmp %ah,%al 74 06 je 2f f3 90 pause 8a 07 mov (%rdi),%al eb f6 jmp 1b 2: 5d pop %rbp c3 retq and after: <do_raw_spin_lock>: 55 push %rbp 48 89 e5 mov %rsp,%rbp e8 1f f6 3d 00 callq <mcount> b8 00 01 00 00 mov $0x100,%eax f0 66 0f c1 07 lock xadd %ax,(%rdi) 0f b6 d4 movzbl %ah,%edx 1: 38 d0 cmp %dl,%al 74 06 je 2f f3 90 pause 8a 07 mov (%rdi),%al eb f6 jmp 1b 2: 5d pop %rbp c3 retq IOW the generated code is identical except that the original has "cmp %ah,%al" and the compiler generates movzbl %ah,%edx cmp %dl,%al I posted a bug about gcc not generating cmp %ah, %al in this case, but the general consensus was that it's not a good choice on current Intel chips, because it causes a partial word stall, and that the current generated code is better. (And gcc can generate "cmp %Xh, %Xl" if it wants to - see below.) Likewise trylock before: <do_raw_spin_trylock>: 55 push %rbp 48 89 e5 mov %rsp,%rbp e8 50 fa 3d 00 callq <mcount> 0f b7 07 movzwl (%rdi),%eax 38 e0 cmp %ah,%al 8d 90 00 01 00 00 lea 0x100(%rax),%edx 75 05 jne 1f f0 66 0f b1 17 lock cmpxchg %dx,(%rdi) 1: 0f 94 c2 sete %dl 0f b6 c2 movzbl %dl,%eax 5d pop %rbp c3 retq and after: <do_raw_spin_trylock>: 55 push %rbp 48 89 e5 mov %rsp,%rbp e8 fd f5 3d 00 callq <mcount> 31 c0 xor %eax,%eax 66 8b 17 mov (%rdi),%dx 38 d6 cmp %dl,%dh 75 16 jne 1f 8d 8a 00 01 00 00 lea 0x100(%rdx),%ecx 89 d0 mov %edx,%eax f0 66 0f b1 0f lock cmpxchg %cx,(%rdi) 66 39 d0 cmp %dx,%ax 0f 94 c0 sete %al 0f b6 c0 movzbl %al,%eax 1: 5d pop %rbp c3 retq The differences here are a little more extensive, but the code is broadly comparable. Some interesting points on the gcc code: * it pre-loads the failure return value, so it doesn't need to use sete unless its actually doing the cmpxchg - whether this is good or not depends on how often trylock fails vs succeeds, but is pretty minor either way * it generates a 16 bit load, rather than using zero extending 32-bit load * it *can* generate a "cmp %dl,%dh" if it wants to * it ends up re-comparing the "old" and "new" values after cmpxchg rather than reusing the flags it sets. This could be fixed by having a cmpxchg() variant which returns a flag rather than old, which would be generally useful since most cmpxchg() callers end up doing that comparison anyway. spin_unlock is a little trickier to compare because it's inlined, but I'm guessing that's the source of the code shrinkage. Likewise, for NR_CPUS=512, there's a ~900 byte kernel shrinkage with the new code: text data bss dec hex filename 7296571 1945380 2424832 11666783 b2055f vmlinux-spin-base-big 7295610 1945412 2424832 11665854 b201be vmlinux-cleantick-big The generated code for do_raw_spin_lock is even closer: Before: <do_raw_spin_lock>: 55 push %rbp 48 89 e5 mov %rsp,%rbp e8 cf 0a 3e 00 callq <mcount> b8 00 00 01 00 mov $0x10000,%eax f0 0f c1 07 lock xadd %eax,(%rdi) 0f b7 d0 movzwl %ax,%edx c1 e8 10 shr $0x10,%eax 1: 39 c2 cmp %eax,%edx 74 07 je 2f f3 90 pause 0f b7 17 movzwl (%rdi),%edx eb f5 jmp 1b 2: 5d pop %rbp c3 retq After: <do_raw_spin_lock>: 55 push %rbp 48 89 e5 mov %rsp,%rbp e8 63 07 3e 00 callq <mcount> b8 00 00 01 00 mov $0x10000,%eax f0 0f c1 07 lock xadd %eax,(%rdi) 89 c2 mov %eax,%edx c1 ea 10 shr $0x10,%edx 1: 66 39 d0 cmp %dx,%ax 74 07 je 2f f3 90 pause 66 8b 07 mov (%rdi),%ax eb f4 jmp 1b 2: 5d pop %rbp c3 retq In other words, identical aside from using 16 bit regs rather than 32 bit regs and zero extend. Trylock: Before: <do_raw_spin_trylock>: 55 push %rbp 48 89 e5 mov %rsp,%rbp e8 aa 0a 3e 00 callq <mcount> 8b 07 mov (%rdi),%eax 89 c2 mov %eax,%edx c1 c0 10 rol $0x10,%eax 39 c2 cmp %eax,%edx 8d 90 00 00 01 00 lea 0x10000(%rax),%edx 75 04 jne 1f f0 0f b1 17 lock cmpxchg %edx,(%rdi) 1: 0f 94 c2 sete %dl 0f b6 c2 movzbl %dl,%eax 5d pop %rbp c3 retq After: <do_raw_spin_trylock>: 55 push %rbp 48 89 e5 mov %rsp,%rbp e8 3e 07 3e 00 callq <mcount> 31 c0 xor %eax,%eax 8b 17 mov (%rdi),%edx 89 d1 mov %edx,%ecx c1 e9 10 shr $0x10,%ecx 66 39 ca cmp %cx,%dx 75 14 jne 1f 8d 8a 00 00 01 00 lea 0x10000(%rdx),%ecx 89 d0 mov %edx,%eax f0 0f b1 0f lock cmpxchg %ecx,(%rdi) 39 d0 cmp %edx,%eax 0f 94 c0 sete %al 0f b6 c0 movzbl %al,%eax 1: 5d pop %rbp c3 retq The differences here are similar to the < 256 CPU case: * gcc generates a straightforward shift and 16-bit compare to compare the head and tail, rather than the rol version (which I guess keeps everything 32b) * same pre-loading the failure return value rather than reusing sete * same redundant compare after cmpxchg So conclusion: * overall kernel code size reduction * the spinlock code is very similar * the trylock code could be improved a little, but its unclear to me that it would make much difference (since there's a big fat locked cmpxchg in the middle of any interesting code path) * unlock is inlined, so I couldn't evaluate that, but since its just an unlocked inc, its hard to see how that could go wrong. J -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@...r.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists