linux-kernel - Re: [PATCH 0/4] amd64

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20090601145326.GA28260@liondog.tnic>
Date:	Mon, 1 Jun 2009 16:53:26 +0200
From:	Borislav Petkov <petkovbb@...glemail.com>
To:	"H. Peter Anvin" <hpa@...or.com>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Borislav Petkov <borislav.petkov@....com>, greg@...ah.com,
	mingo@...e.hu, norsk5@...oo.com, tglx@...utronix.de,
	mchehab@...hat.com, aris@...hat.com, edt@....ca,
	linux-kernel@...r.kernel.org, randy.dunlap@...cle.com
Subject: Re: [PATCH 0/4] amd64_edac: misc fixes

> Obviously not, since it's a relatively new opcode.  However, it is
> supported by both Intel and AMD with the opcode F3 0F B8 /r.
> 
> The "/r" is the real problem ... it means one can't just mimic it with
> hard-coding .byte directives without fixing the arguments (which means a
> performance hit.)  Furthermore, the 0F B8 opcode is JMPE, which doesn't
> take the same arguments either.

How about we pin the src/dst into a register:

#define popcnt_spelled(x)                                       \
({                                                              \
        typeof(x) __ret;                                        \
        __asm__(".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t"    \
                ".byte 0xb8\n\t.byte 0xc0\n\t"                  \
                : "=a" (__ret)                                  \
                : "0" (x));                                     \
        __ret;                                                  \
})

which generates

  40055e:       48 8b 45 e8             mov    -0x18(%rbp),%rax
  400562:       f3 48 0f b8 c0          popcnt %rax,%rax
  400567:       48 89 45 f8             mov    %rax,-0x8(%rbp)

here.

For < 64bit operand sizes, the operands get zero-extended so that
garbage in the high 32/48 bits of %rax doesn't corrupt the result.
We might even want to do the movzwq explicitly so that some compiler
doesn't decide to take the version with the "0f b6" opcode which
zero-extends only the 16-/32-bit register. This way, you can popcnt even
single bytes although the popcnt implementation doesn't allow single
byte operands.

  400572:       0f b7 45 f2             movzwl -0xe(%rbp),%eax
  400579:       f3 48 0f b8 c0          popcnt %rax,%rax
  40057e:       66 89 45 f6             mov    %ax,-0xa(%rbp)


So, in addition to popcnt itself, we have two movs added. This is still
less than the 30+ ops (+ function call overhead) that hweight* get
translated into. I'll redo my kernel build benchmarks tomorrow to get
some more recent numbers on the performance gain.

-- 
Regards/Gruss,
    Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/